ML
Last updated
Last updated
The number of input variables or features for a dataset is referred to as its dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.
High-dimensionality statistics and dimensionality reduction techniques are often used for data visualisation. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.
The performance of machine learning algorithms can degrade with too many input variables.
If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features.
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.
Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.
This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.”
Therefore, it is often desirable to reduce the number of input features.
This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
Perhaps the most common are so-called feature selection techniques that use scoring or statistical methods to select which features to keep and which features to delete.
Two main classes of feature selection techniques include wrapper methods and filter methods.
Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating the model with different subsets of input features and selecting the subset the results in the best model performance. RFE is an example of a wrapper feature selection method.
Filter methods use scoring methods, like correlation between the feature and the target variable, to select a subset of input features that are most predictive. Examples include Pearson’s correlation and Chi-Squared test.
Feature Projection — also called Feature Extraction or Feature Transformation — extracts "important" features by applying transformations to the original features. Depending on the type of the applied transformation, feature transformation approaches are further divided into Linear and Non-Linear:
Linear methods linearly combine the original features to compress the original dataset into fewer dimensions. Common methods include the Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD).
Non-Linear methods are more complex but can find useful reductions of the dimensions where linear methods fail. Reducing the dimensionality to only rotation and scale for Figure 1 would not be possible for a linear method. Non-linear Dimensionality Reduction methods include the kernel PCA, t-SNE, Autoencoders, Self-Organizing Maps, IsoMap, and UMap.
For the sake of simplicity, we will focus on the PCA and LDA methods.
PCA is one the simplest and by far the most common method for Dimensionality Reduction. It can be thought of as a lossy compression method that linearly combines dimensions to reduce them, keeping as much of the dataset variance as possible. It reduces a dataset into its _Principal Components._ Let's say that the Principal Components are the orthogonal directions along which the variance in the data is the greatest.
Note that features need to be scaled to all have the same variance (unit variance of 1) before using PCA. Otherwise, PCA will overestimate the importance of variables with large magnitude
Advantages of PCA
There are two main advantages of dimensionality reduction with PCA.
The training time of the algorithms reduces significantly with less number of features.
It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be 100(100-1)2 = 4950
. Practically it is not possible to analyze data this way.
Applying PCA
It is only a matter of three lines of code to perform PCA using Python's Scikit-Learn library. The PCA
class is used for this purpose. PCA depends only upon the feature set and not the label data. Therefore, PCA can be considered as an unsupervised machine learning technique.
Performing PCA using Scikit-Learn is a two-step process:
Initialize the PCA
class by passing the number of components to the constructor.
Call the fit
and then transform
methods by passing the feature set to these methods. The transform
method returns the specified number of principal components.
Take a look at the following code:
In the code above, we create a PCA
object named pca
. We did not specify the number of components in the constructor. Hence, all four of the features in the feature set will be returned for both the training and test sets.
The PCA class contains explained_variance_ratio_
which returns the variance caused by each of the principal components. Execute the following line of code to find the "explained variance ratio".
The explained_variance
variable is now a float type array which contains variance ratios for each principal component. The values for the explained_variance
variable looks like this:
0.722265
0.239748
0.0333812
0.0046056
It can be seen that first principal component is responsible for 72.22% variance. Similarly, the second principal component causes 23.9% variance in the dataset. Collectively we can say that (72.22 + 23.9) 96.21% percent of the classification information contained in the feature set is captured by the first two principal components.
You can try to use 1 principal component to train algorithm. To do so, execute the following code:
The rest of the process is training, making predictions and evaluating the performance.
Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also be used as a dimensionality reduction technique, providing a projection of a training dataset that best separates the examples by their assigned class.
The ability to use Linear Discriminant Analysis for dimensionality reduction often surprises most practitioners. It should not be confused with “Latent Dirichlet Allocation” (LDA), which is also a dimensionality reduction technique for text documents.
Linear Discriminant Analysis seeks to best separate (or discriminate) the samples in the training dataset by their class value. Specifically, the model seeks to find a linear combination of input variables that achieves the maximum separation for samples between classes (class centroids or means) and the minimum separation of samples within each class.
There are many ways to frame and solve LDA; for example, it is common to describe the LDA algorithm in terms of Bayes Theorem and conditional probabilities.
In practice, LDA for multi-class classification is typically implemented using the tools from linear algebra, and like PCA, uses matrix factorization at the core of the technique. As such, it is good practice to perhaps standardize the data prior to fitting an LDA model.
Performing LDA
It requires only four lines of code to perform LDA with Scikit-Learn. The LinearDiscriminantAnalysis
class of the sklearn.discriminant_analysis
library can be used to Perform LDA in Python. Take a look at the following script:
In the script above the LinearDiscriminantAnalysis
class is imported as LDA
. Like PCA, we have to pass the value for the n_components
parameter of the LDA, which refers to the number of linear discriminates that we want to retrieve. In this case we set the n_components
to 1, since we first want to check the performance of our classifier with a single linear discriminant. Finally we execute the fit
and transform
methods to actually retrieve the linear discriminants.
Notice, in case of LDA, the transform
method takes two parameters: the X_train
and the y_train
. However in the case of PCA, the transform
method only requires one parameter i.e. X_train
. This reflects the fact that LDA takes the output class labels into account while selecting the linear discriminants, while PCA doesn't depend upon the output labels.
Both PCA and LDA are linear transformation techniques. However, PCA is an unsupervised while LDA is a supervised dimensionality reduction technique.
PCA has no concern with the class labels. In simple words, PCA summarizes the feature set without relying on the output. PCA tries to find the directions of the maximum variance in the dataset. In a large feature set, there are many features that are merely duplicate of the other features or have a high correlation with the other features. Such features are basically redundant and can be ignored. The role of PCA is to find such highly correlated or duplicate features and to come up with a new feature set where there is minimum correlation between the features or in other words feature set with maximum variance between the features. Since the variance between the features doesn't depend upon the output, therefore PCA doesn't take the output labels into account.
Unlike PCA, LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. LDA tries to find a decision boundary around each cluster of a class. It then projects the data points to new dimensions in a way that the clusters are as separate from each other as possible and the individual elements within a cluster are as close to the centroid of the cluster as possible. The new dimensions are ranked on the basis of their ability to maximize the distance between the clusters and minimize the distance between the data points within a cluster and their centroids. These new dimensions form the linear discriminants of the feature set.
The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width.
Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
In case of uniformly distributed data, LDA almost always performs better than PCA. However if the data is highly skewed (irregularly distributed) then it is advised to use PCA since LDA can be biased towards the majority class.
Finally, it is beneficial that PCA can be applied to labeled as well as unlabeled data since it doesn't rely on the output labels. On the other hand, LDA requires output classes for finding linear discriminants and hence requires labeled data.
You’ve probably already been using neural networks on a daily basis. When you ask your mobile assistant to perform a search for you—say, Google or Siri or Amazon Web—or use a self-driving car, these are all neural network-driven. Computer games also use neural networks on the back end, as part of the game system and how it adjusts to the players, and so do map applications, in processing map images and helping you find the quickest way to get to your destination.
A neural network is a system or hardware that is designed to operate like a human brain.
Neural networks can perform the following tasks:
Translate text
Identify faces
Recognize speech
Read handwritten text
Control robots
And a lot more...
1943 – Neural Nets
• The foundations of the neural networks was established by Warren McCulloch and Walter Pitts. • Drawing parallels between the brainand computing machines • Hebb’s rule: In 1949, Donald Hebb explained the synaptic plasticity, the adaptation of brain neurons during the learning process.
1969 – XOR problem
• In 1969 Marvin Minskyand Seymour Papert showed that it was impossible for perceptron to learn how to solve very simple problems, such as the XOR function.
1974 – Backpropagation
• Backpropagation learning rule (first mentioned under different name in 1963) wasn't applied until 1974 in the context of neural networks • It led to a “renaissance” in the field of artificial neural network research • Paul Werbosdefined a newtraining process for Artificial Neural Networks using backpropagation of errors • It has become one of the most widely used tool in the field of ANN
1986 – Multilayer Perceptron
• The first version of this architecture was proposed by Rumelhartet al. • They proposed this feedforward architecture as a solution to solve non-lineal problems • It could be locallyor fullyconnected • Application examples: data compression, character recognition or Time Series prediction
Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks.
A layer is made of perceptrons which are linked to the perceptrons of the previous and the next layers if such do happen to exist. Every layer defines it’s own functionality and therefore serves its own purpose. Neural networks consist of an input layer(takes the initial data), an output layer(returns the overall result of the network), and hidden layers(one or many layers with different sizes(number of perceptrons) and functionality).
The Fully-Connected Layer is the most widely used class type. Its principles of work are based on the Rosenblatt model and are as follow:
Every single perceptron from the previous layer is linked to every single perceptron of this layer.
Each link has a weighted value(weight).
A bias is added to the results.
The layer’s weights are held in a 2D array with size m x n(where m is the number of perceptrons in the previous layer and n is the number of perceptrons in this layer). They are initialized as random values.
The layer’s bias is held in a 1D array with size n. It is initialized as random values.
-------------
In order for the network to be able to learn and produce results each layer has to implement two functions — forward propagation and backward propagation(shortly backpropagation).
Imagine a train travelling between point A(input) and point B(output) which changes direction each time it reaches one of the points. The A to B course takes one or more samples from the input layer and carries it through the forward propagation functions of all hidden layers consecutively, until point B is reached(and a result is produced). Backpropagation is basically the same thing only in the opposite direction — the course takes the data through the backpropagation methods of all layers in a reverse order until it reaches point A. What differs the two courses though is what happens inside of these methods.
Forward propagation is only responsible for running the input through a function and return the result. No learning, only calculations. Backpropagation is a bit trickier because it is responsible for doing two things:
Update the parameters of the layer in order to improve the accuracy of the forward propagation method.
Implement the derivative of the forward propagation function and return the result.
So how and why does that happen exactly. The mystery unravels at point B — before the train changes direction and goes through the backpropagation of all the layers. In order to tune our model we need to answer two questions:
How good the model’s result is compared with the actual output?
How do we minimize this difference?
The process of answering the first question is known as calculating the error. To do that we use cost functions(synonyms with loss functions).
There are different types of cost functions that do completely different calculations but all serve the same purpose — to show our model how far it is from the actual result. Choosing a cost function is strictly tied to the purpose of the model but in this article we will only use one of the most popular variations — Mean Squared Error(MSE).
The last thing we need to do here is to show our model how to minimize the error. To do that we need an optimization algorithm(optimizer). Once again, there are many kinds of optimizers all serving the same purpose but for the sake of keeping things simple but still meaningful we are going to use the most widely used and the one which many other optimization algorithms are based on. Behold the mighty Gradient Descent:
By definition, the gradient is a fancy word for derivative, or the rate of change of a function.
So let’s imagine our model is a ball. The surface represents the gradient(derivative) of the error. We want the ball to roll down the surface(descent) as low as possible in order to decrease the altitude(the error). Taking it to Math level — we need to reach a global(or at least a good enough local) minimum.
In order to make the ball move though, we need to update our parameters at a certain rate — called learning rate. This is a predefined parameter that we pass to our model before we run it. Those kind of parameters are called hyperparameters and have a huge role in our model’s performance.
If we choose a learning rate that is too big the parameters will change drastically and we might skip the minimum. If our learning rate is too small on the other hand, it will take too much time and hence computing power to reach a satisfying result. That’s why tuning this parameter by testing the model with different values of the learning rate is rather important. It is highly recommended to start with a learning rate of 0.1 or 0.01 and start tuning from there on.
Now we need to update the model’s parameters layer by layer by passing the appropriate data to the backpropagation methods. The backpropagation takes two parameters — output error and the learning rate. Output error is calculated either as the result of the derivative of the cost function or as the result of the backpropagation of the previous layer(if looked from point B to point A) — as written above, the backward propagation should give as a result the derivative of the forward propagation function. By doing this each layer shows its predecessor its error.
Now that we’ve got the two parameters needed, the backpropagation should update the layers weights(if such are present). Since every type of layer is different, it defines its own logic for parameter tuning — something which we will cover in a bit.
When each layer’s backpropagation is complete and our train arrives at point A, it takes the next sample(or set of samples) and starts its course through the hidden layers’ forward propagation functions once again — only this time, they should perform a bit better. This process continues on and on until training is completed and/or an error minimum has been reached.
The only problem with Fully-Connected Layers though is that they are linear. In fact, most layers have completely linear logic. A linear function is a polynomial of one degree. Using only such functions hinders the model’s ability to learn complex functional mappings, hence, learning is limited. That’s why(by convention) it is good to add non-linear functionality after every linear layer using activation layers.
Activation layers are just like any other type of layer except they don’t have weights but use a non-linear function over the input instead.
There are perhaps three activation functions you may want to consider for use in hidden layers; they are:
Rectified Linear Activation (ReLU)
Logistic (Sigmoid)
Hyperbolic Tangent (Tanh)
This is not an exhaustive list of activation functions used for hidden layers, but they are the most commonly used.
ReLU
The rectified linear activation function, or ReLU activation function, is perhaps the most common function used for hidden layers.
It is common because it is both simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or “dead” units.
The ReLU function is calculated as follows:
max(0.0, x)
This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise, the value is returned.
Sigmoid
The sigmoid activation function is also called the logistic function.
It is the same function used in the logistic regression classification algorithm.
The function takes any real value as input and outputs values in the range 0 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.
The sigmoid activation function is calculated as follows:
1.0 / (1.0 + e^-x)
Where e is a mathematical constant, which is the base of the natural logarithm.
Bias is just like an intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Moreover, bias value allows you to shift the activation function to either right or left.
output = sum (weights * inputs) + bias
The output is calculated by multiplying the inputs with their weights and then passing it through an activation function like the Sigmoid function, etc. Here, bias acts like a constant which helps the model to fit the given data. The steepness of the Sigmoid depends on the weight of the inputs.
a constructor — here we need to pass the hyper-parameters(learning rate and number of epochs — the number of times our network will run over the input dataset); initialization of necessary fields
add layer method — pass an instance of a layer; used for model construction; can(should) be used several times in order to add several layers;
use cost function method — specify the cost function to be used when training the model
fit method — a standard name for the method that performs the training process; here is where we will place the gradient descent snippet from earlier
predict method — a standard name for the method that is used to calculate results only; it is useful once the training process is complete
Artificial neural networks are paving the way for life-changing applications to be developed for use in all sectors of the economy. Artificial intelligence platforms that are built on ANNs are disrupting the traditional ways of doing things. From translating web pages into other languages to having a virtual assistant order groceries online to conversing with chatbots to solve problems, AI platforms are simplifying transactions and making services accessible to all at negligible costs.
Artificial neural networks have been applied in all areas of operations. Email service providers use ANNs to detect and delete spam from a user’s inbox; asset managers use it to forecast the direction of a company’s stock; credit rating firms use it to improve their credit scoring methods; e-commerce platforms use it to personalize recommendations to their audience; chatbots are developed with ANNs for natural language processing; deep learning algorithms use ANN to predict the likelihood of an event; and the list of ANN incorporation goes on across multiple sectors, industries, and countries.
For great explanation of Neural Networks, check the following video: