Week27

What is Machine Learning?

Machine learning is an algorithm trained in the direction of data analysis. In machine learning, the goal is to optimize the accuracy of predictions. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Types of ML Algorithms

Supervised Learning

Supervised learning is one of the most basic types of machine learning. In this type, the machine learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for this method to work, supervised learning is extremely powerful when used in the right circumstances. The ML algorithm is given a small training dataset to work with in supervised learning. This training dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic idea of the problem, solution, and data points to be dealt with. The training dataset is also very similar to the final dataset in its characteristics and provides the algorithm with the labeled parameters required for the problem. The algorithm then finds relationships between the parameters given, essentially establishing a cause and effect relationship between the variables in the dataset. At the end of the training, the algorithm has an idea of how the data works and the relationship between the input and the output.

Unsupervised Learning

Unsupervised machine learning holds the advantage of being able to work with unlabeled data. This means that human labor is not required to make the dataset machine-readable, allowing much larger datasets to be worked on by the program. In supervised learning, the labels allow the algorithm to find the exact nature of the relationship between any two data points. However, unsupervised learning does not have labels to work off of, resulting in the creation of hidden structures. Relationships between data points are perceived by the algorithm in an abstract manner, with no input required from human beings.

Reinforcement Learning

Reinforcement learning directly takes inspiration from how human beings learn from data in their lives. It features an algorithm that improves upon itself and learns from new situations using a trial-and-error method. Favorable outputs are encouraged, or ‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.

Regression

Regression is the connection between dependent and independent variables on data. By analyzing this connection? We create an equation with coefficients according to the effects of each independent variable. This equation (so-called function) calculates our dependent variable’s predicted value (target variable).

E.g. y= a.x1+bx3+.....+kxN + c

Note: The function we aim to create the optimal function with the minimum distance between our function and the data. (Least Square Method)

Machine learning, more specifically the field of predictive modeling, is primarily concerned with minimizing the error of a model or making the most accurate predictions possible at the expense of explainability. We will borrow, reuse, and steal algorithms from many fields, including statistics, and use them to these ends in applied machine learning. As such, linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables but has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm.

  • Representation of Linear regression is the simplest one over all ML algorithms.

    • The representation is a linear equation that combines a specific set of input values (x); the solution is the predicted output for that set of input values (y). Both the input values (x) and the output value are numeric.

y= B0 + B1 * x1

B0 is the bias coefficient, and B1 is the coefficient for the data type. We use a learning technique to find a good set of coefficient values. Once found, we can plug in different data type values to predict y.

SSE(RSS)

By this method, we calculate how well our algorithm did with the generation of the function we’ll use in order to make the predictions.

Ɛ1 = y1-yt1

Ɛ2 = y2-yt2

Ɛ3= y3-yt3

Where ytn=ax+b, ie. the function generated by the algorithm.

Least Square Method

You are welcome to watch the rest of the playlist in that youtube channel

Now, try to draw the graph with LSM and check your answer below!

Featurex1x2x3x4

Age (Month)

2

3

5

7

Weight (Kg)

4

4

5

7

Since we covered how linear regression works theoretically, Let’s see the Python application!

Multilinear Regression

What is Multiple Linear Regression?

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple linear regression aims to model the relationship between explanatory (independent) and response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable.

Key Takeaways

  • Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.

  • Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.

  • MLR is used extensively in econometrics and financial inference.

  • y = the predicted value of the dependent variable

  • B0 = the y-intercept (value of y when all other parameters are set to

  • B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value)

  • … = do the same for however many independent variables you are testing

  • BnXn = the regression coefficient of the last independent variable

  • e = model error (a.k.a. how much variation there is in our estimate of y)

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.

  • The t-statistic of the overall model.

  • The associated p-value (how likely it is that the t-statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. Take a look at the code below.

Normalisation and Standardisation

Normalisation is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here’s the formula for normalisation:

  • When the value of X is the minimum value in the column, the numerator will be 0. Hence, X’ is 0

  • On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1

  • If the value of X is between the minimum and the maximum value. The value of X’ is between 0 and 1.

Standardisation is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Here’s the formula for standardization:

Note that the values are not restricted to a particular range in this case. Now, the big question in your mind must be when should we use normalization and when should we use standardization and how do we use them? Let’s find out!

Log Transformation

Log transformation is a data transformation method in which it replaces each variable x with a log(x). The choice of the logarithm base is usually left up to the analyst and it would depend on the purposes of statistical modeling. In this article, we will focus on the natural log transformation. The nature log is denoted as ln.

When our original continuous data do not follow the bell curve, we can log transform this data to make it as “normal” as possible so that the statistical analysis results from this data become more valid. In other words, the log transformation reduces or removes the skewness of our original data. The important caveat here is that the original data has to follow or approximately follow a log-normal distribution. Otherwise, the log transformation won’t work.

Ridge Regression

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.

What is (Multi)collinearity?

Multicollinearity, or collinearity, is the existence of near-linear relationships among the independent variables. For example, suppose that the three ingredients of a mixture are studied by including their percentages of the total. These variables will have the (perfect) linear relationship: P1 + P2 + P3 = 100. During regression calculations, this relationship causes a division by zero which in turn causes the calculations to be aborted. When the relationship is not exact, the division by zero does not occur and the calculations are not aborted. However, the division by a very small quantity still distorts the results. Hence, one of the first steps in a regression analysis is to determine if multicollinearity is a problem.

Why is it dangerous?

Multicollinearity can create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients, deflate the partial t-tests for the regression coefficients, give false, nonsignificant, p- values, and degrade the predictability of the model.

Why does Multicollinearity occur?

  1. Data collection. In this case, the data have been collected from a narrow subspace of the independent variables. The multicollinearity has been created by the sampling methodology—it does not exist in the population. Obtaining more data on an expanded range would cure this multicollinearity problem. The extreme example of this is when you try to fit a line to a single point.

  2. Physical constraints of the linear model or population. This source of multicollinearity will exist no matter what sampling technique is used. Many manufacturing or service processes have constraints on independent variables (as to their range), either physically, politically, or legally, which will create multicollinearity.

  3. Over-defined model. Here, there are more variables than observations. This situation should be avoided.

  4. Model choice or specification. This source of multicollinearity comes from using independent variables that are powers or interactions of an original set of variables. It should be noted that if the sampling subspace of independent variables is narrow, then any combination of those variables will increase the multicollinearity problem even further.

  5. Outliers. Extreme values or outliers in the X-space can cause multicollinearity as well as hide it. We call this outlier-induced multicollinearity. This should be corrected by removing the outliers before ridge regression is applied.

What is Ridge Regression?

Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values being far away from the actual values.

The formula of Ridge Regression

β̂ridge=(X′X+λI)−1(X′Y)

Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge function. So, by changing the values of alpha, we are controlling the penalty term. The higher the values of alpha, the bigger the penalty is, and therefore the magnitude(weight) of coefficients is reduced. (i.e., the parameter causing the error gets to have a fewer coefficient and have less effect on the prediction.)

Note: In ridge regression, the first step is to standardize the variables (both dependent and independent) by subtracting their means and dividing by standard deviations. However, for simplicity, we will assume all variables are standardized!

Lasso Regression

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model. Larger penalties result in coefficient values closer to zero, which is the ideal for producing simpler models.

Note: L2 regularization (e.g. Ridge regression) doesn’t result in elimination of coefficients or sparse models. This makes the Lasso far easier to interpret than the Ridge.

This video also gives a proper illustration and explanation of the differences between Ridge and Lasso regression!

Polynomial Regression

Depending on what the data looks like, we can do a polynomial regression on the data to fit a polynomial equation to it.

If we try to use a simple linear regression in the above graph then the linear regression line won’t fit very well. It is very difficult to fit a linear regression line in the above graph with a low value of error. Hence we can try to use the polynomial regression to fit a polynomial line so that we can achieve a minimum error or minimum cost function. The equation of the polynomial regression for the above graph data would be:

y = θo + θ₁x₁ + θ₂ x₁²

This is the general equation of a polynomial regression is:

Y=θo + θ₁X + θ₂X² + … + θₘXᵐ + residual error

Advantages of using Polynomial Regression

  • Polynomial provides the best approximation of the relationship between the dependent and independent variable.

  • A Broad range of function can be fit under it.

  • Polynomial basically fits a wide range of curvature.

Disadvantages of using Polynomial Regression

  • The presence of one or two outliers in the data can seriously affect the results of the nonlinear analysis.

  • These are too sensitive to the outliers.

  • In addition, there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression.

Categorical Variable Transformations

In many Machine-learning or Data Science activities, the data set might contain text or categorical values (basically non-numerical values). For example, color feature having values like red, orange, blue, white etc. Meal plan having values like breakfast, lunch, snacks, dinner, tea etc. Few algorithms such as CATBOAST, decision-trees can handle categorical values very well but most of the algorithms expect numerical values to achieve state-of-the-art results.

Over your learning curve in AI and Machine Learning, one thing you would notice that most of the algorithms work better with numerical inputs. Therefore, the main challenge faced by an analyst is to convert text/categorical data into numerical data and still make an algorithm/model to make sense out of it. Neural networks, which is a base of deep-learning, expects input values to be numerical.

There are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set.

One-hot Encoding

One-hot encoding is an approach to convert one categorical column to multiple binary (0 or 1) columns as many as the number of distinct levels in the original column. If there are four levels on the categorical variable, one-hot encoding will create four new columns, each of which has 0 or 1 and represents if the original column has the level.

Here is the code:

Label Encoding

Label encoding is an approach to convert the levels to integers e.g. levels: [‘A’, ‘B’, ‘C’, …] to integers: [0, 1, 2, …].

This approach is not appropriate in most machine learning algorithms because the amount of transformed value actually has nothing to do with the target variable, except decision-tree based models that may be able to split the transformed numeric column multiple times with layering the tree node split. Also, in case the categorical variable has ‘ordinal’ nature e.g. Cold<Warm<Hot<Very Hot, label encoding can potentially work better than other encoding techniques.

Same link with One-hot encoding goes here.

Binary Encoding

Binary encoding is an approach to turn the categorical column to multiple binary columns while minimizing the number of new columns. First, turn the categorical value to integers by some orders (e.g. alphabetical order or order of appearance for the top row). Next, turn it to binary digit such that 1 to 1, 2 to 10, 5 to 101, etc. Finally, split the binary digit into separate columns each of which has a single digit.

Binary encoding can reduce the number of new columns to log_2(number of levels) order. As you can see in the example above, one of the new columns has 1 in different original level, which is not a good thing because the levels having 1 in the same column will be treated to share some property by the model while actually they have 1 in the same column just for a technical reason.

Same link with One-hot encoding goes here.

Continious Variable to Categorical Variable

Last updated

Change request #338: