I’ve made another blog post that I believe should give you a better introduction to machine learning. Please take a look at REDO: Intro to Machine Learning to use K-nearest neighbor
What is Machine Learning?
Machine Learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on models and inference instead.
Before we can get started we need to define the problem we are going to solve. Let us say you work in car insurance and your boss asks…
If it hails 29 times this year in San Francisco, How much is the cost in damages?
Identify the data that is relevant to the problem.
Well, let’s break down the problem…
If the number of times it hails changes, we will probably see a change in the damage cost.
Meaning if it hails 15 times in San Francisco this next year, then the damage would be a lot less then if it were to hail 32 times.
The independent variable is the amount of time it hails per year and the damage cost would be the dependent variable.
(independent variable) hail per year++ === (dependent variable) damage cost++
(independent variable) hail per year– === (dependent variable) damage cost–
These variables have special names to them in Machine Learning.
A feature is the independent variable. The exact defination from wikipedia:
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed.
To learn more about features please visit the wiki: Machine Learning Features
A Label is the dependent variable. This is the expected prediction, in our case is the damage cost.
You have to categorize your friends by their height and weight into three different groups likely FIT, NoN-FIT. so here height , weight of each person are features and group names (FIT, NoN-FIT) are labels. Enjoy… : )
To learn more about labels please visit the Quora question: Machine Learning Label
Assemble the DATA
Almost always we are going to have to manipulate multiple sources of data to get the exact information we are looking for. For software engineers this can range from multiple API requests, getting information from multiple tables in your relational database, or can be multiple sections of objects in your NoSQL database. There are more ways to gather information but the important thing is to manipulate the data to create a data structure that will work with your machine learning program.
The next part will be to research your data sources. Your research can include many different sources to obtain the right information. Data points can come from a wide variety of information, such as APIs, spreadsheets, web scraping, and so on.
Research for proper features that support your label.
Decide what kind of output you want to predict.
After you have collected your data sources for both your label and features, then it is time to decide what kind of output you would like to predict. There are many different outputs to choose from but for the sake of this tutorial, we will be discussing two common type of output. Those types are Classification and Regression.
Classifications is the process of predicting the class of given data points. Classes are sometimes called as features (or targets)/labels or categories. Classifications predictive modeling is the task of approximating a mapping function from input variables (features) to discrete output variables (labels).
The value of our labels belongs to a discrete set. This means only a few outputs are available, think of it as true and false.
Regression predictive modeling is the task of approximating a mapping function from input variables to a continuous output variable. A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes. For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.
The value of our labels belongs to a continuous set. This means the output will be in a range of values rather than a discrete set.
Fundamentally, classification is about predicting a label and regression is about predicting a quantity.
1. Identify the features and labels
Don’t forget that “Features” are categories of data points that affect the value of the “Label”
2. Assemble a set of data related to the problem you are trying to solve
Datasets almost always need to be cleaned up for formatting
3. Decide on the type of output you are predicting
Regression used with continuous values, classification used with discrete values.
4. Based on the type of output, pick an algorithm that will determine a correlation between your “Features” and “Labels”
There are many different algorithms that exist, each of these algorithms has their pros and cons. Research a few algorithms and create models using each one this way you can see which algorithm has the best performance and predictions.
5. Use model generated by the algorithm to make a prediction
Models relate the value of “Features” to the value of “Labels”
Want to come to DEVELOPERWEEK in SF BAY AREA (#DEVWEEK2019)?
You must belogged in to post a comment.