“Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type — quietly but meaningfully improving core operations.” ~Jeff Bezos.
Table of contents
- What is Machine Learning
- Type of Machine Learning
- How to solve Machine Learning problem ?
What is Machine Learning?
Machine learning is a way of using data to train a computer model so that it can predict new data values based on known inputs.
The model learns from training cases, and we can then use that trained model to make predictions for new data cases. So, we start with a data set that contains historical records, often called cases or observations. Each observation includes numeric features that quantify a characteristic of the item that we’re working with. So, let’s call this X. In general, we also have some value that we’re trying to predict; let’s call this Y.
We use our training cases to train a machine learning model so that it can calculate a value for Y from the features in X. So, in very simplistic terms, what we’re doing is creating a function that operates on a set of features X, to produce predictions Y.
Types of Machine Learning
Generally speaking, there are two broad kinds of machine learning:
- Supervised learning and
- Unsupervised learning.
In supervised learning scenarios, we start with observations that include known values for the variable we want to predict. We call these labels. Then we use a machine learning algorithm to train a model that fits the features to the known label. Because we started with a known label value, we can validate the model by comparing the value predicted to the actual label value that we had in the first place; then when we’re happy that the model works we can use it with new observations for which the label is unknown and generate new predicted values.
Supervised learning problem is further divided into two major problems based on target variable(label). These are:
- Regression problem: When target variable is continuous in nature
- Classification problem: When target variable is categorical.
Suppose a retailer has some historic sales data, and he wants to use machine learning to predict what will be sales in any given day so that he will keep ready his inventory. When we need to predict a numeric value like this, we can use a supervised learning technique called regression. So, for example, let’s suppose retailer has gathered his historic sales data including the customer profile, product information, locality information, seasonal events, weather information, and the number of sales. Now what we want to do is to model the sales using the features that retailer has recorded.
In this case, we know all of the feature values, and we know the label value of 1000; so, we need our algorithm to learn a function that operates on all of the features to give us a resulting label of 1000. Of course, a sample of only one day isn’t likely to give us a function that generalizes well; so we’d gather the same data from lots of days, and train our model based on this larger set of data; and after we’ve trained our model we’ll have a generalized function that can be used to calculate our label Y for any given vector of features of X values.
Now because we started with data that includes the label we’re trying to predict; we can train the model using only some of the data and withhold the rest so that we can evaluate the model’s performance. We can use the model to predict the function of X for our evaluation data, and then compare the predictions or scored labels to the actual labels that we know to be true. We then calculate the standard deviation for these as errors, or residuals, of the model. Now we call this measure of error the root mean squared error, or RMSE; and it’s an absolute measure of the standard deviation of error in the model. For example, an RMSE value of 3 means that the standard deviation of the error from our test data is three sales.
Another metric that we can use to evaluate a regression model is the mean absolute error, which is just the average of the absolute values of the errors. In effect MAE is the average of all the model errors. Relative absolute error is the MAE relative to the mean value of the label. Now because this is a relative metric, it’s not measured in sales but rather it gives you a relative value for the error within a scale of zero and one; with zero being a perfect model. Relative metrics like this can be useful way to generally evaluate the performance of a model. Another relative metric is relative squared error, which is the RMSE divided by the sum of the squares of the label; and finally the coefficient of determination, which is also known as the R-squared of the model, represents the predictive power of the model as a value between 0 and 1. A value of 0 means that the model is random and has learned nothing about the data. A perfect fit would result with a value of 1.
So, we’ve seen how to train a regression model to predict a numeric value. Now it’s time to look at another kind of supervised learning: classification. Classification is a technique we can use to predict which class, or category, something belongs to. The simplest variant of this is binary classification, where we predict whether an entity belongs to one of two classes; and often that’s used to determine if something is true or false about the entity.
For example, suppose retailer keeps records of which days he made a profit, and which days he didn’t; and he has the similar features of each day. We can train a machine learning model that can be applied to the daily features and give the result 1 for days that are profitable, and 0 for days that aren’t. Or more generally, a binary classifier is a function that can be applied to features X to produce a Y value of 1 or 0. Now the function won’t actually calculate an absolute value of 1 or 0, it will calculate a value between 1 and 0 and we’ll use a threshold value to decide whether the result should be counted as a 1 or a 0. When you use the model to predict values, the resulting value is classed as a 1 or 0 depending on which side of the threshold it falls.
Because classification is a supervised learning technique, we withhold some of our test data to validate the model using known labels. Cases where the model predicts a 1 for a test observation that actually has a label of 1 are considered true positives. Similarly, cases where the model predicts 0 and the actual label is 0, these are true negatives. If the model predicts 1 but the actual label is 0, well that’s a false positive; and if the model predicts 0 but the actual label is 1, that’s a false negative. The choice of threshold determines how predictions are assigned to classes. In some cases, a predicted value may be very close to the threshold, but still misclassified. You can move the threshold to control how the predicted values are classified. In some scenarios, it might be better to have more false positives but reduce the number of false negatives; and in other scenarios the opposite might be better. Either way, the number of true positives, false positives, true negatives, and false negatives produced by your model is crucial in evaluating its effectiveness. The counts of these are often shown as what’s called a confusion matrix, and this provides the basis for calculating performance metrics for a classifier.
The simplest metric is accuracy, which is just the number of correctly classified cases divided by the total number of cases. In this example, there are five true positives and four true negatives; and there are also two false positives and no false negatives; so that gives us nine correct predictions out of a total of eleven which is an accuracy of 0.8 or 82%. Now that might seem like a really good result but, perhaps surprisingly, accuracy actually isn’t all that useful as a measure of model’s performance. Suppose for example that only three percent of days are profitable. I could create a model that simply always predicts zero, and it would be 97% accurate; but that’s completely useless for identifying which days will be profitable. A more useful metric might be the fraction of cases classified as positive that are actually positive. This metric is known as precision – in other words out of all of the cases that we classified as positives, which ones are in fact positive. In this case there are five true positives and two false positives, so our precision is five out of seven which is 0.71. So, 71% of cases identified as positive really are profitable days, and 29% are misclassified. In some situations, we might want a metric that’s sensitive to the fraction of positive cases we correctly identify, and we call it recall. It’s calculated as the number of true positives divided by the combined true positives and false negatives – in other words, what fraction of positive cases are correctly identified. In this case there are five true positives and no false negatives so our recall is five out of five, which is of course 1, or 100%. So, in this case we’re correctly identifying all of the profitable days. Now recall actually has another name – sometimes it’s known as the true positive rate, and there’s an equivalent rate for false positives compared to the actual number of negatives. In this case we have two false positives and four true negatives, so our false positive rate is 2 out of 6 which is 0.33. Now you may remember that the metrics we’ve got here are based on a threshold of around 0.3, and we can plot the true positive rate against the false positive rate for the threshold like this. If we move the threshold back to 0.5, our true positive rate becomes 4 out of 5, or 0.8, and the false positive rate is now 1 out of 6, or 0.167 which we can plot here. Moving the threshold further to say, 0.7 gives us a true positive rate of two out of five or 0.4 and a false positive rate of zero out of six or 0. If we plotted this for every possible threshold rate we’d end up with a curved line like this, and it’s known as a receiver operating characteristic, or ROC, chart. Now the area under this curve or AUC is an indication of how well the model predicts. Generally, you want to see a large AUC with the curve staying close to the top left corner of the chart. A perfect classifier would go straight up the left-hand side and then along the top giving an AUC of 1. You can compare this curve with a diagonal line which represents how well this model would perform if you simply made a 50/50 guess, or an AUC of 0.5. In this case the model has an AUC of around 0.9, so definitely outperforms guessing.
There are various machine learning algorithms to solve above supervised problems. Most popular ones are:
- Linear & Logistic Regression
- K-nearest neighbor
- Support vector Machines
- Bayesian algorithms
- Decision Tree
- Bagging models: Extra tree classifier, Randomforest
- Boosting models: Gradient boosting, Adaboost, Catboost, Light GBM, Extreme gradient boosting
- Deep Learning models: DNN, CNN, RNN etc.
Unsupervised learning is different from supervised machine learning in that we don’t have a known label in the training data set. We train the model by finding similarities in the observations. After the model is trained, each new observation is assigned to a cluster of observations with similar characteristics. Unsupervised learning can be classified into below major categories:
- Association learning
How to solve machine learning problem?
Below is the flow of solving any machine learning problems:
- Data pre-processing
- Data exploration
- Exploratory data analysis
- Splitting the data for validation
- Feature engineering
- Deciding performance metric
- Evaluating different machine learning algorithms
- Testing final algorithm on unseen data
- Deploying solution
One or two steps can be exceptions in supervised and unsupervised solution but overall template remain the same.
Series Name : ML1