Week 28
Contents
Decision Tree
Random Forest
Gradient Boosting Machine
XGBoost
LightGBM
CatBoost
Model Evaluation Metrics
Accuracy
Precision
Recall
F1 Score
Log Loss
Area Under the Curve ( AUC)
Dimension Reduction
PCA
LDA
Decision Tree
A decision tree is a map of the possible outcomes of a series of related choices. It allows an individual or organization to weigh possible actions against one another based on their costs, probabilities, and benefits.
A decision tree has the following constituents:
• Root Node: The factor of ‘temperature’ is considered as the root in this case.
• Internal Node: The nodes with one incoming edge and 2 or more outgoing edges.
• Leaf Node: This is the terminal node with no out-going edge.
A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a tree-like shape.
We are choosing our decisions to be quite “high-level” in order to keep the tree small. The decisions will be selected such that the tree is as small as possible while aiming for high classification / regression accuracy.
THE GREEDY APPROACH
Greedy Approach is based on the concept of Heuristic Problem Solving by making an optimal local choice at each node. By making these local optimal choices, we reach the approximate optimal solution globally.
The algorithm can be summarized as :
1. At each stage (node), pick out the best feature as the test condition.
2. Now split the node into the possible outcomes (internal nodes).
3. Repeat the above steps till all the test conditions have been exhausted into leaf nodes.
When you start to implement the algorithm, the first question is: ‘How to pick the starting test condition?’
The answer to this question lies in the values of ‘Entropy’ and ‘Information Gain’. Let us see what are they and how do they impact our decision tree creation.
How to Select Best Feature to Split
INFORMATION GAIN
GINI INDEX
Here is an example of applied decision tree algorithm you might wanna check:
An attribute should have the highest information gain to be selected for splitting. Based on the computed values of Entropy and Information Gain, we choose the best attribute at any particular step.
There can be n number of decision trees that can be formulated from these set of attributes.
DISADVANTAGES
Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
Decision trees are prone to errors in classification problems with many class and a relatively small number of training examples.
Decision trees can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared.
RANDOM FOREST
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble.
THE WISDOM OF CROWDS
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.
Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction).
While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.
Bagging (Bootstrap Aggregation) — Decisions trees are very sensitive to the data they are trained on — small changes to the training set can result in significantly different tree structures. Random forest takes advantage of this by allowing each individual tree to randomly sample from the dataset with replacement, resulting in different trees. This process is known as bagging.
Node splitting in a random forest model is based on a random subset of features for each tree.
Feature Randomness — In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations in the left node vs. those in the right node. In contrast, each tree in a random forest can pick only from a random subset of features. This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.
GRADIENT BOOSTING MACHINE (GBM)
1 - XGBOOST
XGBoost stands for eXtreme Gradient Boosting.
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. It is an implementation of gradient boosted decision trees designed for speed and performance.
The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer several advanced features.
Model Features
The implementation of the model supports the features of the scikit-learn and R implementations, with new additions like regularization. Three main forms of gradient boosting are supported:
Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
Regularised Gradient Boosting with both L1 and L2 regularization.
System Features
The library provides a system for use in a range of computing environments, not least:
Parallelization of tree construction using all of your CPU cores during training.
Distributed Computing for training very large models using a cluster of machines.
Out-of-Core Computing for very large datasets that don’t fit into memory.
Cache Optimization of data structures and algorithm to make best use of hardware.
Algorithm Features
The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:
Sparse Aware implementation with automatic handling of missing data values.
Block Structure to support the parallelization of tree construction.
Continued Training so that you can further boost an already fitted model on new data.
XGBoost is free open source software available for use under the permissive Apache-2 license.
Why use XGBoost ?
The two reasons to use XGBoost are also the two goals of the project:
Execution Speed
Model Performance
Generally, XGBoost is fast. Really fast when compared to other implementations of gradient boosting.
Szilard Pafka performed some objective benchmarks comparing the performance of XGBoost to other implementations of gradient boosting and bagged decision trees. He wrote up his results in May 2015 in the blog post titled “Benchmarking Random Forest Implementations“.
He also provides all the code on GitHub and a more extensive report of results with hard numbers.
Model Performance
XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.
What algorithm does XGBoost use?
The XGBoost library implements the gradient boosting decision tree algorithm.
This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.
Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
This approach supports both regression and classification predictive modeling problems.
Light GBM
What is Light GBM?
Light GBM is a gradient boosting framework that uses tree based learning algorithm.
How it differs from other tree based algorithms?
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
Below diagrams explain the implementation of LightGBM and other boosting algorithms.
Why Light GBM is gaining extreme popularity?
The size of data is increasing day by day and it is becoming difficult for traditional data science algorithms to give faster results. Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.
Can we use Light GBM everywhere?
No, it is not advisable to use LGBM on small datasets. Light GBM is sensitive to overfitting and can easily overfit small data. There is no threshold on the number of rows, but based on experience, scientists suggest that should be used only for data with 10,000+ rows.
It is very important to get familiar with basic parameters of an algorithm that you are using. LightGBM has more than 100 parameters that are given in the documentation of LightGBM, but there is no need to study all of them. You can check important parameters from the link below.
Here is a useful link for proper understanding of light gbm:
Cat Boost
CatBoost is the first Russian machine learning algorithm developed to be open source. The algorithm was developed in the year 2017 by machine learning researchers and engineers at Yandex.
The intention is to serve multi-functional purposes such as
· Recommendation systems,
· Personal assistants,
· Self-driving cars,
· Weather prediction, and many other tasks.
CatBoost algorithm is another member of the gradient boosting technique on decision trees. One of the many unique features that the CatBoost algorithm offers is the integration to work with diverse data types to solve a wide range of data problems faced by numerous businesses.
Not just that, but CatBoost also offers accuracy just like the other algorithm in the tree family. The term CatBoost is an acronym that stands for "Category” and “Boosting.” Does this mean the “Category’ in CatBoost means it only works for categorical features?
The answer is, “No.”
According to the CatBoost documentation, CatBoost supports numerical, categorical, and text features but has a good handling technique for categorical data. The CatBoost algorithm has quite a number of parameters to tune the features in the processing stage. "Boosting" in CatBoost refers to the gradient boosting machine learning. Gradient boosting is a machine learning technique for regression and classification problems. Which produces a prediction model in an ensemble of weak prediction models, typically decision trees.
Gradient boosting is a robust machine learning algorithm that performs well when used to provide solutions to different types of business problems such as
· Fraud detection,
· Recommendation system,
· Forecasting.
Again, it can return an outstanding result with relatively fewer data. Unlike other machine learning algorithms that only perform well after learning from extensive data.
Easy Implementation
CatBoost offers easy-to-use interfaces. The CatBoost algorithm can be used in Python with scikit-learn, R, and command-line interfaces. Fast and scalable GPU version: the researchers and machine learning engineers designed CatBoost at Yandex to work on data sets as large as tens of thousands of objects without lagging.
Training your model on GPU gives a better speedup when compared to training the model on CPU. To crown this improvement, the larger the dataset is, the more significant the speedup. CatBoost efficiently supports multi-card configuration. So, for large datasets, use a multi-card configuration.
Faster Training & Predictions
Before the improvement of servers, the maximum number of GPUs per server is 8 GPUs. Some data sets are more extensive than that, but CatBoost uses distributed GPUs. This feature enables CatBoost to learn faster and make predictions 13-16 times faster than other algorithms.
PIPELINE
1. Import the libraries/modules needed
2. Import data
3. Data cleaning and preprocessing
4. Train-test split
5. CatBoost training and prediction
6. Model Evaluation
Model Evaluation Metrics
Once you have built your model, the most important question that arises is how good is your model? So, evaluating your model is the most important task in the data science project which delineates how good your predictions are.
The following figure shows the results of the model that have been built for one of the projects.
Two-class Boosted Decision Tree Algorithm has been used and the goal is to predict the survival of the passengers on the Titanic.
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. All the measures except AUC can be calculated by using four parameters. So, let’s talk about those four parameters first.
True positive and true negatives are the observations that are correctly predicted and therefore shown in green. We want to minimize false positives and false negatives so they are shown in red color. These terms are a bit confusing. So let’s take each term one by one and understand it fully.
True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that this passenger survived and predicted class tells you the same thing.
True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this passenger did not survive and predicted class tells you the same thing.
False positives and false negatives, these values occur when your actual class contradicts with the predicted class.
False Positives (FP) – When actual class is no and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted class tells you that this passenger will survive.
False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if actual class value indicates that this passenger survived and predicted class tells you that passenger will die.
Once you understand these four parameters then we can calculate Accuracy, Precision, Recall and F1 score.
Then how do we interprete the other metrics?
Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.803 which means our model is approx. 80% accurate.
Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? We have got recall of 0.631 which is good for this model as it’s above 0.5.
F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. In our case, F1 score is 0.701.
AREA UNDER THE ROC CURVE (AUC - ROC)
This is again one of the popular metrics used in the industry. The biggest advantage of using ROC curve is that it is independent of the change in proportion of responders. This statement will get clearer in the following sections.
Let’s first try to understand what is ROC (Receiver operating characteristic) curve. If we look at the confusion matrix below, we observe that for a probabilistic model, we get different value for each metric.
Hence, for each sensitivity, we get a different specificity. The two vary as follows:
The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for the case in hand.
Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the confusion matrix :
As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is ~60%. This coordinate becomes on point in our ROC curve. To bring this curve down to a single number, we find the area under this curve (AUC).
Note that the area of entire square is 1*1 = 1. Hence AUC itself is the ratio under the curve and the total area. For the case in hand, we get AUC ROC as 96.4%. Following are a few thumb rules:
.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)
We see that we fall under the excellent band for the current model. But this might simply be over-fitting. In such cases it becomes very important to to in-time and out-of-time validations.
POINTS TO REMEMBER
For a model which gives class as output, will be represented as a single point in ROC plot.
Such models cannot be compared with each other as the judgement needs to be taken on a single metric and not using multiple metrics. For instance, model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared.
In case of probabilistic model, we were fortunate enough to get a single number which was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions. It is also possible that one model performs better in some region and other performs better in other.
LOG LOSS
AUC ROC considers the predicted probabilities for determining our model’s performance. However, there is an issue with AUC ROC, it only takes into account the order of probabilities and hence it does not take into account the model’s capability to predict higher probability for samples more likely to be positive. In that case, we could us the log loss which is nothing but negative average of the log of corrected predicted probabilities for each instance.
p(yi) is predicted probability of positive class
1-p(yi) is predicted probability of negative class
yi = 1 for positive class and 0 for negative class (actual values)
Let us calculate log loss for a few random values to get the gist of the above mathematical function:
Logloss(1, 0.1) = 2.303
Logloss(1, 0.5) = 0.693
Logloss(1, 0.9) = 0.105
If we plot this relationship, we will get a curve as follows:
It’s apparent from the gentle downward slope towards the right that the Log Loss gradually declines as the predicted probability improves. Moving in the opposite direction though, the Log Loss ramps up very rapidly as the predicted probability approaches 0.
So, lower the log loss, better the model. However, there is no absolute measure on a good log loss and it is use-case/application dependent.
Whereas the AUC is computed with regards to binary classification with a varying decision threshold, log loss actually takes “certainty” of classification into account.
Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.
High-dimensionality statistics and dimensionality reduction techniques are often used for data visualisation. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.
Problem with many input variables
The performance of machine learning algorithms can degrade with too many input variables.
If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features.
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.
Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.
This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.”
Therefore, it is often desirable to reduce the number of input features.
This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
Techniques for Dimensionality Reduction
Feature Selection Methods
Perhaps the most common are so-called feature selection techniques that use scoring or statistical methods to select which features to keep and which features to delete.
Two main classes of feature selection techniques include wrapper methods and filter methods.
Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating the model with different subsets of input features and selecting the subset the results in the best model performance. RFE is an example of a wrapper feature selection method.
Filter methods use scoring methods, like correlation between the feature and the target variable, to select a subset of input features that are most predictive. Examples include Pearson’s correlation and Chi-Squared test.
Feature Projection
Feature Projection — also called Feature Extraction or Feature Transformation — extracts "important" features by applying transformations to the original features. Depending on the type of the applied transformation, feature transformation approaches are further divided into Linear and Non-Linear:
Linear methods linearly combine the original features to compress the original dataset into fewer dimensions. Common methods include the Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD).
Non-Linear methods are more complex but can find useful reductions of the dimensions where linear methods fail. Reducing the dimensionality to only rotation and scale for Figure 1 would not be possible for a linear method. Non-linear Dimensionality Reduction methods include the kernel PCA, t-SNE, Autoencoders, Self-Organizing Maps, IsoMap, and UMap.
For the sake of simplicity, we will focus on the PCA and LDA methods.
Principal Component Analysis (PCA)
PCA is one the simplest and by far the most common method for Dimensionality Reduction. It can be thought of as a lossy compression method that linearly combines dimensions to reduce them, keeping as much of the dataset variance as possible. It reduces a dataset into its _Principal Components._ Let's say that the Principal Components are the orthogonal directions along which the variance in the data is the greatest.
Note that features need to be scaled to all have the same variance (unit variance of 1) before using PCA. Otherwise, PCA will overestimate the importance of variables with large magnitude
Advantages of PCA
There are two main advantages of dimensionality reduction with PCA.
The training time of the algorithms reduces significantly with less number of features.
It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be
100(100-1)2 = 4950
. Practically it is not possible to analyze data this way.
Applying PCA
It is only a matter of three lines of code to perform PCA using Python's Scikit-Learn library. The PCA
class is used for this purpose. PCA depends only upon the feature set and not the label data. Therefore, PCA can be considered as an unsupervised machine learning technique.
Performing PCA using Scikit-Learn is a two-step process:
Initialize the
PCA
class by passing the number of components to the constructor.Call the
fit
and thentransform
methods by passing the feature set to these methods. Thetransform
method returns the specified number of principal components.
Take a look at the following code:
In the code above, we create a PCA
object named pca
. We did not specify the number of components in the constructor. Hence, all four of the features in the feature set will be returned for both the training and test sets.
The PCA class contains explained_variance_ratio_
which returns the variance caused by each of the principal components. Execute the following line of code to find the "explained variance ratio".
The explained_variance
variable is now a float type array which contains variance ratios for each principal component. The values for the explained_variance
variable looks like this:
It can be seen that first principal component is responsible for 72.22% variance. Similarly, the second principal component causes 23.9% variance in the dataset. Collectively we can say that (72.22 + 23.9) 96.21% percent of the classification information contained in the feature set is captured by the first two principal components.
You can try to use 1 principal component to train algorithm. To do so, execute the following code:
The rest of the process is training, making predictions and evaluating the performance.
LDA - Linear Discriminant Analysis
Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also be used as a dimensionality reduction technique, providing a projection of a training dataset that best separates the examples by their assigned class.
The ability to use Linear Discriminant Analysis for dimensionality reduction often surprises most practitioners. It should not be confused with “Latent Dirichlet Allocation” (LDA), which is also a dimensionality reduction technique for text documents.
Linear Discriminant Analysis seeks to best separate (or discriminate) the samples in the training dataset by their class value. Specifically, the model seeks to find a linear combination of input variables that achieves the maximum separation for samples between classes (class centroids or means) and the minimum separation of samples within each class.
There are many ways to frame and solve LDA; for example, it is common to describe the LDA algorithm in terms of Bayes Theorem and conditional probabilities.
In practice, LDA for multi-class classification is typically implemented using the tools from linear algebra, and like PCA, uses matrix factorization at the core of the technique. As such, it is good practice to perhaps standardize the data prior to fitting an LDA model.
Performing LDA
It requires only four lines of code to perform LDA with Scikit-Learn. The LinearDiscriminantAnalysis
class of the sklearn.discriminant_analysis
library can be used to Perform LDA in Python. Take a look at the following script:
In the script above the LinearDiscriminantAnalysis
class is imported as LDA
. Like PCA, we have to pass the value for the n_components
parameter of the LDA, which refers to the number of linear discriminates that we want to retrieve. In this case we set the n_components
to 1, since we first want to check the performance of our classifier with a single linear discriminant. Finally we execute the fit
and transform
methods to actually retrieve the linear discriminants.
Notice, in case of LDA, the transform
method takes two parameters: the X_train
and the y_train
. However in the case of PCA, the transform
method only requires one parameter i.e. X_train
. This reflects the fact that LDA takes the output class labels into account while selecting the linear discriminants, while PCA doesn't depend upon the output labels.
PCA vs LDA: What's the Difference?
Both PCA and LDA are linear transformation techniques. However, PCA is an unsupervised while LDA is a supervised dimensionality reduction technique.
PCA has no concern with the class labels. In simple words, PCA summarizes the feature set without relying on the output. PCA tries to find the directions of the maximum variance in the dataset. In a large feature set, there are many features that are merely duplicate of the other features or have a high correlation with the other features. Such features are basically redundant and can be ignored. The role of PCA is to find such highly correlated or duplicate features and to come up with a new feature set where there is minimum correlation between the features or in other words feature set with maximum variance between the features. Since the variance between the features doesn't depend upon the output, therefore PCA doesn't take the output labels into account.
Unlike PCA, LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. LDA tries to find a decision boundary around each cluster of a class. It then projects the data points to new dimensions in a way that the clusters are as separate from each other as possible and the individual elements within a cluster are as close to the centroid of the cluster as possible. The new dimensions are ranked on the basis of their ability to maximize the distance between the clusters and minimize the distance between the data points within a cluster and their centroids. These new dimensions form the linear discriminants of the feature set.
Comparison of LDA and PCA 2D projection of Iris dataset
The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width.
Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
PCA vs LDA: What to Choose for Dimensionality Reduction?
In case of uniformly distributed data, LDA almost always performs better than PCA. However if the data is highly skewed (irregularly distributed) then it is advised to use PCA since LDA can be biased towards the majority class.
Finally, it is beneficial that PCA can be applied to labeled as well as unlabeled data since it doesn't rely on the output labels. On the other hand, LDA requires output classes for finding linear discriminants and hence requires labeled data.
REFERENCES
Last updated