ML
Machine Learning Week-3
Last updated
Machine Learning Week-3
Last updated
Our issues are Cross Validation, Model Tuning, Classification Models -1
Given a dataset, and an appropriate algorithm to train the dataset with, if the model fits the dataset rightly, it will be able to give accurate predictions on never-seen-before data.
On the other hand, if the Machine Learning model hasn’t been trained properly on the given data due to various reasons, the model will not be able to make accurate or nearly good predictions on the data.
This is because the model would have failed to capture the essential patterns from the data.
If a model that is being trained is stopped prematurely, it can lead to underfitting. This means data won’t be trained for the right amount of time, due to which it won’t be able to perform well with the new data. This would lead to the model not being able to give good results, and they could not be relied upon.
The dashed line in blue is the model that underfits the data. The black parabola is the line of data points that fits the model well.
This is just the opposite of underfitting. It means that instead of extracting the patterns or learning the data just right, it learns too much. This means all the data is basically captured, including noise (irrelevant data, that wouldn’t contribute to the prediction of output when new data is encountered) thereby leading to not being able to generalize the model to new data.
The model, during training, performs well, and learns all data points, literally memorizing the data that it has been given. But when it is in the testing phase or a new data point is introduced to it, it fails miserably. The new data point will not be captured by an overfit machine learning model.
Note: In general, more the data, better the training, leading to better prediction results. But it should also be made sure that the model is not just capturing all points, instead it is learning, thereby removing the noise present in the data.
Before exposing the model to the real world, the training data is divided into two parts. One is called the ‘training set’ and the other is known as the ‘test set’. Once the training is completed on the training dataset, the test set is exposed to the model to see how it behaves with newly encountered data. This gives a sufficient idea about how accurately the model can work with new data, and its accuracy.
You can see underfitting and overfitting from the link below.
Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a tradeoff between a model’s ability to minimize bias and variance. Gaining a proper understanding of these errors would help us not only to build accurate models but also to avoid the mistake of overfitting and underfitting.
So let’s start with the basics and see how they make difference to our machine learning Models.
What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data
The train-test split is a technique for evaluating the performance of a machine learning algorithm.
It can be used for classification or regression problems and can be used for any supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.
This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.
The train-test procedure is appropriate when there is a sufficiently large dataset available.
When to Use the Train-Test Split The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.
A suitable representation of the problem domain means that there are enough records to cover all common cases and most uncommon cases in the domain. This might mean combinations of input variables observed in practice. It might require thousands, hundreds of thousands, or millions of examples.
Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).
If you have insufficient data, then a suitable alternate model evaluation procedure would be the k-fold cross-validation procedure.
In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.
Some models are very costly to train, and in that case, repeated evaluation used in other procedures is intractable. An example might be deep neural network models. In this case, the train-test procedure is commonly used.
Alternately, a project may have an efficient model and a vast dataset, although may require an estimate of model performance quickly. Again, the train-test split procedure is approached in this situation.
Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.
You can see how train-test splitting is done from the link below.
Our dataset should be as large as possible to train the model and removing considerable part of it for validation poses a problem of losing valuable portion of data that we would prefer to be able to train. In order to address this issue, we use the K-fold Cross validation technique.
In K Fold cross validation, the data is divided into k subsets and train our model on k-1 subsets and hold the last one for test. This process is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. Below is the sample code performing k-fold cross validation on logistic regression.
Accuracy of our model is 77.673% and now let’s tune our hyperparameters. In the above code, I am using 5 folds. But, how do we know number of folds to use?
The more the number of folds, less is value of error due to bias but increasing the error due to variance will increase; the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It would also computationally cheaper. Therefore, in big datasets, k=3 is usually advised.
The k value must be chosen carefully for your data sample.
A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).
Three common tactics for choosing a value for k are as follows:
Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-validation
The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller
You can see how k-fold c.v is done from the link below.
Before diving deep into stratified cross-validation, it is important to know about stratified sampling. Stratified sampling is a sampling technique where the samples are selected in the same proportion (by dividing the population into groups called ‘strata’ based on a characteristic) as they appear in the population. For example, if the population of interest has 30% male and 70% female subjects, then we divide the population into two (‘male’ and ‘female’) groups and choose 30% of the sample from the ‘male’ group and ‘70%’ of the sample from the ‘female’ group.
You can see how stratified k-fold is done from the link below.
Hold-out method for model evaluation represents the mechanism of splitting the dataset into training and test dataset and evaluating the model performance in order to get the most optimal model. The following represents the hold-out method for model evaluation.
In above diagram, you may note that the data set is split into two parts. One split is set aside or hold out for training the model. Another set is set aside or hold out for testing or evaluating the model. The split percentage is decided based on the volume of the data available for training purpose. Generally, 70-30% split is used for splitting the dataset where 70% of dataset is used for training and 30% dataset is used for testing the model.
This technique is well suited if the goal is to compare the models based on the model accuracy on the test dataset and select the best model. However, there is always a possibility that trying to use this technique can result in the model fitting well to the test dataset. In other words, the models are trained in order to improve model accuracy on the test dataset assuming that the test dataset represents the population. The test error, thus, becomes optimistically biased estimation of generalization error. However, that is not correct. The final model fails to generalize well to the unseen or future dataset as it is trained to fit well (or overfit) with respect to the test data.
The following is the process of using hold-out method for model evaluation:
Split the dataset into two parts (preferably based on 70-30% split; However, the percentage split will vary)
Train the model on the training dataset; While training the model, some fixed set of hyper parameters is selected.
Test or evaluate the model on the held-out test dataset
Train the final model on the entire dataset to get a model which can generalize better on the unseen or future dataset.
Note that this process is used for model evaluation based on splitting the dataset into training and test dataset and using a fixed set of hyper parameters. There is another technique of splitting the data into three sets and use these three sets for model selection or hyper parameters tuning. We will look at that technique in next section.
Hold-out method for Model Selection
The hold-out method can also be used for model selection or hyper parameters tuning. As a matter of fact, at times, the model selection process is referred to as hyper-parameters tuning. In hold-out method for model selection, the dataset is split into three different sets – training, validation and test dataset.
The following process represents hold-out method for model selection:
Split the dataset in three parts – Training dataset, validation dataset and test dataset.
Train different models using different machine learning algorithms. For example, train the classification model using logistic regression, random forest, XGBoost.
For the models trained with different algorithms, tune the hyper-parameters and come up with different models. For each of the algorithms mentioned in step 2, change hyper parameters settings and come with multiple models.
Test the performance of each of these models (belonging to each of the algorithms) on the validation dataset.
Select the most optimal model out of models tested on the validation dataset. The most optimal model will have the most optimal hyper parameters settings for specific algorithm. Going by the above example, lets say the model trained with XGBoost with most optimal hyper parameters gets selected.
Test the performance of the most optimal model on the test dataset.
The above can be understood using the following diagram. Note the three different split of the original dataset. The process of training, tuning and evaluation is repeated multiple times and the most optimal model is selected. The final model is evaluated on the test dataset.
You can see how leave one out is done from the link below.
You can see how bootstrap is done from the link below.
You can see how bootstrap is done from the link below.
And You can watch it in detail on youtube from the link below.
(4. AND 5. WERE PASSED SPEEDLY BECAUSE HAVING LESS IMPORTANCE THAN OTHERS)
Hyperparameters are hugely important in getting good performance with models. In order to understand this process, we first need to understand the difference between a model parameter and a model hyperparameter.
Model parameters are internal to the model whose values can be estimated from the data and we are often trying to estimate them as best as possible . whereas hyperparameters are external to our model and cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are model-specific properties that are ‘fixed’ before you even train and test your model on data.
The process for finding the right hyperparameters is still somewhat of a dark art, and it currently involves either random search or grid search across Cartesian products of sets of hyperparameters.
There are bunch of methods available for tuning of hyperparameters. In this blog post, I chose to demonstrate using two popular methods. first one is grid search and the second one is Random Search.
GridSearch takes a dictionary of all of the different hyperparameters that you want to test, and then feeds all of the different combinations through the algorithm for you and then reports back to you which one had the highest accuracy.
Using grid search, even though there are more hyperparameters let’s us tune the ‘C value’ also known as the ‘regularization strength’ of our logistic regression as well as ‘penalty’ of our logistic regression algorithm.
First, let us create logistic regression object and assign different values over which we need to test.
You can learn the detailed explanation and codes from the scikitlearn library from the link below.
It is performed by evaluating n uniformly random points in the hyperparameter space, and select the one producing the best performance.
Now, we instantiate the random search and fit it like any Scikit-Learn model:
These values are close to the values obtained with grid search
Tuning and finding the right hyperparameters for your model is an optimization problem. We want to minimize the loss function of our model by changing model parameters. Bayesian optimization helps us find the minimal point in the minimum number of steps. Bayesian optimization also uses an acquisition function that directs sampling to areas where an improvement over the current best observation is likely.
You can look the codes and details from the link below.
A common job of machine learning algorithms is to recognize objects and being able to separate them into categories. This process is called classification, and it helps us segregate vast quantities of data into discrete values, i.e. :distinct, like 0/1, True/False, or a pre-defined output label class.
In this article titled ‘Everything you need to know about Classification in Machine Learning’, you will learn about classification, and the following topics too:
What is Supervised Learning?
What is Classification?
Classification Models
Before we dive into Classification, let’s take a look at what Supervised Learning is. Suppose you are trying to learn a new concept in maths and after solving a problem, you may refer to the solutions to see if you were right or not. Once you are confident in your ability to solve a particular type of problem, you will stop referring to the answers and solve the questions put before you by yourself.
This is also how Supervised Learning works with machine learning models. In Supervised Learning, the model learns by example. Along with our input variable, we also give our model the corresponding correct labels. While training, the model gets to look at which label corresponds to our data and hence can find patterns between our data and those labels.
Some examples of Supervised Learning include:
It classifies spam Detection by teaching a model of what mail is spam and not spam.
Speech recognition where you teach a machine to recognize your voice.
Object Recognition by showing a machine what an object looks like and having it pick that object from among other objects.
We can further divide Supervised Learning into the following:
Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.
We can call a Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.
What is the Sigmoid Function?
In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
Binary (Pass/Fail)
Multi (Cats, Dogs, Sheep)
Ordinal (Low, Medium, High)
You can learn codes from the link of scikit learn library below
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
What are the Pros and Cons of Naive Bayes?
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.
4 Applications of Naive Bayes Algorithms
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
How to build a basic model using Naive Bayes in Python and R?
Again, scikit learn (python library) will help here to build a Naive Bayes model in Python. There are three types of Naive Bayes model under the scikit-learn library:
Gaussian: It is used in classification and it assumes that features follow a normal distribution.
Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
You can learn codes and more information from the link of scikit learn library below:
This algorithm is one of the more simple techniques used in machine learning. It is a method preferred by many in the industry because of its ease of use and low calculation time.
What is KNN? KNN is a model that classifies data points based on the points that are most similar to it. It uses test data to make an “educated guess” on what an unclassified point should be classified as.
Pros:
Easy to use.
Quick calculation time.
Does not make assumptions about the data.
Cons:
Accuracy depends on the quality of the data.
Must find an optimal k value (number of nearest neighbors).
Poor at classifying data points in a boundary where they can be classified one way or another.
KNN is an algorithm that is considered both non-parametric and an example of lazy learning. What do these two terms mean exactly?
Non-parametric means that it makes no assumptions. The model is made up entirely from the data given to it rather than assuming its structure is normal.
Lazy learning means that the algorithm makes no generalizations. This means that there is little training involved when using this method. Because of this, all of the training data is also used in testing when using KNN.
KNN is often used in simple recommendation systems, image recognition technology, and decision-making models. It is the algorithm companies like Netflix or Amazon use in order to recommend different movies to watch or books to buy. Netflix even launched the Netflix Prize competition, awarding $1 million to the team that created the most accurate recommendation algorithm!
You might be wondering, “But how do these companies do this?” Well, these companies will apply KNN on a data set gathered about the movies you’ve watched or the books you’ve bought on their website. These companies will then input your available customer data and compare that to other customers who have watched similar movies or bought similar books. This data point will then be classified as a certain profile based on their past using KNN. The movies and books recommended will then depend on how the algorithm classifies that data point.
The image above visualizes how KNN works when trying to classify a data point based a given data set. It is compared to its nearest points and classified based on which points it is closest and most similar to. Here you can see the point Xj will be classified as either W1 (red) or W3 (green) based on its distance from each group of points.
You can learn codes and more from the link below.
Generally, Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.
Support Vectors
Support vectors are the data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins. These points are more relevant to the construction of the classifier.
Hyperplane
A hyperplane is a decision plane which separates between a set of objects having different class memberships.
Margin
A margin is a gap between the two lines on the closest class points. This is calculated as the perpendicular distance from the line to support vectors or closest points. If the margin is larger in between the classes, then it is considered a good margin, a smaller margin is a bad margin.
The main objective is to segregate the given dataset in the best possible way. The distance between the either nearest points is known as the margin. The objective is to select a hyperplane with the maximum possible margin between support vectors in the given dataset. SVM searches for the maximum marginal hyperplane in the following steps:
Generate hyperplanes which segregates the classes in the best way. Left-hand side figure showing three hyperplanes black, blue and orange. Here, the blue and orange have higher classification error, but the black is separating the two classes correctly.
Select the right hyperplane with the maximum segregation from the either nearest data points as shown in the right-hand side figure.
Dealing with non-linear and inseparable planes
Some problems can’t be solved using linear hyperplane, as shown in the figure below (left-hand side).
In such situation, SVM uses a kernel trick to transform the input space to a higher dimensional space as shown on the right. The data points are plotted on the x-axis and z-axis (Z is the squared sum of both x and y: z=x^2=y^2). Now you can easily segregate these points using linear separation.
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm. They also use less memory because they use a subset of training points in the decision phase. SVM works well with a clear margin of separation and with high dimensional space.
SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes. It works poorly with overlapping classes and is also sensitive to the type of kernel used.