maths-stats
Last updated
Last updated
A common question we all hear:
Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start.
But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc.
If you get confused and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra.
But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra.
I would like to present 4 scenarios to showcase why learning Linear Algebra is important, if you are learning Data Science and Machine Learning.
What do you see when you look at the image above? You most likely said flower, leaves -not too difficult. But, if I ask you to write that logic so that a computer can do the same for you – it will be a very difficult task (to say the least).
You were able to identify the flower because the human brain has gone through million years of evolution. We do not understand what goes in the background to be able to tell whether the color in the picture is red or black. We have somehow trained our brains to automatically perform this task.
But making a computer do the same task is not an easy task, and is an active area of research in Machine Learning and Computer Science in general. But before we work on identifying attributes in an image, let us ponder over a particular question- How does a machine stores this image?
You probably know that computers of today are designed to process only 0 and 1. So how can an image such as above with multiple attributes like color be stored in a computer? This is achieved by storing the pixel intensities in a construct called Matrix. Then, this matrix can be processed to identify colors etc.
So any operation which you want to perform on this image would likely use Linear Algebra and matrices at the back end.
If you are somewhat familiar with the Data Science domain, you might have heard about the world “XGBOOST” – an algorithm employed most frequently by winners of Data Science Competitions. It stores the numeric data in the form of Matrix to give predictions. It enables XGBOOST to process data faster and provide more accurate results. Moreover, not just XGBOOST but various other algorithms use Matrices to store and process data.
Deep Learning- the new buzz word in town employs Matrices to store inputs such as image or speech or text to give a state-of-the-art solution to these problems. Weights learned by a Neural Network are also stored in Matrices. Below is a graphical representation of weights stored in a Matrix.
Another active area of research in Machine Learning is dealing with text and the most common techniques employed are Bag of Words, Term Document Matrix etc. All these techniques in a very similar manner store counts (or something similar) of words in documents and store this frequency count in a Matrix form to perform tasks like Semantic analysis, Language translation, Language generation etc.
So, now you would understand the importance of Linear Algebra in machine learning. We have seen image, text or any data, in general, employing matrices to store and process data. This should be motivation enough to go through the material below to get you started on Linear Algebra. This is a relatively long guide, but it builds Linear Algebra from the ground up.
Let’s start with a simple problem. Suppose that price of 1 ball & 2 bat or 2 ball and 1 bat is 100 units. We need to find price of a ball and a bat.
Suppose the price of a bat is Rs ‘x’ and the price of a ball is Rs ‘y’. Values of ‘x’ and ‘y’ can be anything depending on the situation i.e. ‘x’ and ‘y’ are variables.
Let’s translate this in mathematical form:
2x + y = 100 ...........(1)
Similarly, for the second condition:
x + 2y = 100 ..............(2)
Now, to find the prices of bat and ball, we need the values of ‘x’ and ‘y’ such that it satisfies both the equations. The basic problem of linear algebra is to find these values of ‘x’ and ‘y’ i.e. the solution of a set of linear equations.
Broadly speaking, in linear algebra data is represented in the form of linear equations. These linear equations are in turn represented in the form of matrices and vectors.
The number of variables as well as the number of equations may vary depending upon the condition, but the representation is in form of matrices and vectors.
It is usually helpful to visualize data problems. Let us see if that helps in this case.
Linear equations represent flat objects. We will start with the simplest one to understand i.e. line. A line corresponding to an equation is the set of all the points which satisfy the given equation. For example,
Points (50,0) , (0,100), (100/3,100/3) and (30,40) satisfy our equation (1) . So these points should lie on the line corresponding to our equation (1). Similarly, (0,50),(100,0),(100/3,100/3) are some of the points that satisfy equation (2).
Now in this situation, we want both of the conditions to be satisfied i.e. the point which lies on both the lines. Intuitively, we want to find the intersection point of both the lines as shown in the figure below.
Let’s solve the problem by elementary algebraic operations like addition, subtraction and substitution.
2x + y = 100 .............(1)
x + 2y = 100 ..........(2)
from equation (1) :
y = (100- x)/2
put value of y in equation (2) :
x + 2*(100-x)/2 = 100......(3)
Now, since the equation (3) is an equation in single variable x
, it can be solved for x
and subsequently y
.
That looks simple – let’s go one step further and explore.
Now, suppose you are given a set of three conditions with three variables each as given below and asked to find the values of all the variables. Let’s solve the problem and see what happens.
x+y+z=1.......(4)
2x+y=1......(5)
5x+3y+2z=4.......(6)
From equation (4) we get,
z=1-x-y....(7)
Substituting value of z
in equation (6), we get –
5x+3y+2(1-x-y)=4
3x+y=2.....(8)
Now, we can solve equations (8) and (5) as a case of two variables to find the values of ‘x’
and ‘y’
in the problem of bat and ball above. Once we know ‘x’
and ‘y’
, we can use (7) to find the value of ‘z’
.
As you might see, adding an extra variable has tremendously increased our efforts for finding the solution of the problem. Now imagine having 10 variables and 10 equations. Solving 10 equations simultaneously can prove to be tedious and time consuming. Now dive into data science. We have millions of data points. How do you solve those problems?
We have millions of data points in a real data set. It is going to be a nightmare to reach to solutions using the approach mentioned above. And imagine if we have to do it again and again and again. It’s going to take ages before we can solve this problem. And now if I tell you that it’s just one part of the battle, what would you think? So, what should we do? Should we quit and let it go? Definitely NO. Then?
Matrix is used to solve a large set of linear equations. But before we go further and take a look at matrices, let’s visualize the physical meaning of our problem. Give a little bit of thought to the next topic. It directly relates to the usage of Matrices.
A linear equation in 3 variables represents the set of all points whose coordinates satisfy the equations. Can you figure out the physical object represented by such an equation? Try to think of 2 variables at a time in any equation and then add the third one. You should figure out that it represents a three-dimensional analogue of line.
Basically, a linear equation in three variables represents a plane. More technically, a plane is a flat geometric object which extends up to infinity.
As in the case of a line, finding solutions to 3 variables linear equation means we want to find the intersection of those planes. Now can you imagine, in how many ways a set of three planes can intersect? Let me help you out. There are 4 possible cases –
No intersection at all.
Planes intersect in a line.
They can intersect in a plane.
All the three planes intersect at a point.
Can you imagine the number of solutions in each case? Try doing this. Here is an aid picked from Wikipedia to help you visualize.
So, what was the point of having you to visualize all graphs above?
Normal humans like us and most of the super mathematicians can only visualize things in 3-Dimensions, and having to visualize things in 4 (or 10000) dimensions is difficult impossible for mortals. So, how do mathematicians deal with higher dimensional data so efficiently? They have tricks up their sleeves and Matrices is one such trick employed by mathematicians to deal with higher dimensional data.
Now let’s proceed with our main focus i.e. Matrix.
Matrix is a way of writing similar things together to handle and manipulate them as per our requirements easily. In Data Science, it is generally used to store information like weights in an Artificial Neural Network while training various algorithms. You will be able to understand my point by the end of this article.
Technically, a matrix is a 2-D array of numbers (as far as Data Science is concerned). For example look at the matrix A below.
1 2 3
4 5 6
7 8 9
Generally, rows are denoted by ‘i’ and column are denoted by ‘j’. The elements are indexed by ‘i’th row and ‘j’th column. We denote the matrix by some alphabet e.g. A and its elements by A(ij).
In above matrix:
A12 = 2
To reach to the result, go along first row and reach to second column.
Order of matrix – If a matrix has 3 rows and 4 columns, order of the matrix is 3*4 i.e. row*column.
Square matrix – The matrix in which the number of rows is equal to the number of columns.
Diagonal matrix – A matrix with all the non-diagonal elements equal to 0 is called a diagonal matrix.
Upper triangular matrix – Square matrix with all the elements below diagonal equal to 0.
Lower triangular matrix – Square matrix with all the elements above the diagonal equal to 0.
Scalar matrix – Square matrix with all the diagonal elements equal to some constant k.
Identity matrix – Square matrix with all the diagonal elements equal to 1 and all the non-diagonal elements equal to 0.
Column matrix – The matrix which consists of only 1 column. Sometimes, it is used to represent a vector.
Row matrix – A matrix consisting only of row.
Trace – It is the sum of all the diagonal elements of a square matrix.
Please watch to video series "Maths for Data Science" to get a better understanding about the topic.
Please take a look at this slide deck for further information.
Once we talk about measuring central tendency of a variable then 3 M’s come into picture.
1. Mode 2. Median 3. Mean
If your variable of interest is measured in nominal or ordinal (Categorical) level then Mode is the most often used technique to measure the central tendency of your data. Finding the mode is easy. Basically, it is the value that occurs most frequently. In other words, mode is the most common outcome. Mode is the name of the category that occurs more often. There is a chance of having more than one mode in your variable.
Example:
Here you have two modes.
If you still don’t understand and want to calculate the mode step by step, then please follow the link: http://www.purplemath.com/modules/meanmode.htm
The second measure of central tendency is the median. The median is nothing more than the middle value of your observations when they are order from the smallest to the largest.
It involves two steps:
1. Oder your cases from smallest to largest 2. Find the middle Value
• If you have odd number of cases then finding middle value is easy. Let’s think you have 5 cases. So, after ordering always 3rd position is the middle value.
• If you have even number no cases (let’s think 6 cases). In this case there is no single middle value. Then how do we calculate median? Well, we just take the average of the two middle values.
Example:
The third measure of central tendency is the most often used one, and also the one you most probably already know quite well: the mean. The mean is the sum of all the values divided by the number of observations. It is nothing but the average value.
• If your data is Categorical (Nominal or Ordinal) it is impossible to calculate mean or median. So, go for mode.
• If your data is quantitative then go for mean or median. Basically, if your data is having some influential outliers or data is highly skewed then median is the best measurement for finding central tendency. Otherwise go for Mean.
Let’s think, in certain cases, you are comparing two groups. You have already calculated the central tendency of your data i.e. Mean, Median and Mode for both the groups. Sometimes it may happen that mean, median, and mode are same for both groups. Let’s take a look at the below example:
If you consider both the team their Mode= 14.1, Median=15 and Mean=15
This indicates that, if you adequately describe a distribution some time it may need more information than the measures of central tendency.
In this situation measures of variability comes into picture. They are: • Range • Interquartile range. • Box Plot to get good indication of how the values in a distribution are spread out.
The most simple measure of variability is the range. It is the difference between the highest and the lowest value.
For the above Example range will be:
Range(team1) = 19.3 – 10.8 = 8.5
Range(team2) = 27.7-0 = 27.7
As ranges takes only the count of extreme values sometimes it may not give you a good impact on variability. In this case, you can go for another measure of variability called interquartile range (IQR).
Interquartile range gives another measure of variability. It is a better measure of dispersion than range because it leaves out the extreme values.
It equally divides the distribution into four equal parts called quartiles.
First 25% is 1st quartile (Q1), last one is 3rd quartile (Q3) and middle one is 2nd quartile (Q2). 2nd quartile (Q2) divides the distribution into two equal parts of 50%. So, basically it is same as Median.
The interquartile range is the distance between the third and the first quartile, or, in other words, IQR equals Q3 minus Q1: IQR = Q3- Q1
Step 1: Order from low to high Step 2: Find the median or in other words Q2 Step 3: Then find Q1 by looking the median of the left side of Q2 Step 4: Similarly find Q3 by looking the median of the right of Q2 Step 5: Now subtract Q1 from Q3 to get IQR.
Example:
Consider the below example to get clear idea.
Consider another example to get better understanding.
Consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11.
Q1 is the middle value in the first half of the data set. Since there are an even number of data points in the first half of the data set, the middle value is the average of the two middle values; that is, Q1 = (3 + 4)/2
or Q1 = 3.5
. Q3 is the middle value in the second half of the data set. Again, since the second half of the data set has an even number of observations, the middle value is the average of the two middle values; that is, Q3 = (6 + 7)/2
or Q3 = 6.5
. The interquartile range is Q3 minus Q1, so IQR = 6.5 – 3.5 = 3
.
• The main advantage of the IQR is that it is not affected by outliers because it doesn’t take into account observations below Q1 or above Q3. • It might still be useful to look for possible outliers in your study. • As a rule of thumb, observations can be qualified as outliers when they lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile.
Outliers = Q1 - 1.5 * IQR
or Outliers = Q3 + 1.5 * IQR
There is one graph that is mainly used when you are describing center and variability of your data. It is also useful for detecting outliers in the data.
Carefully observe the above first IQR example when it is plotted in a boxplot.
There are two other kind of variability that a statistician use very often for their study:
1. Variance 2. Standard Deviation
Because variance and standard deviation consider all the values of a variable to calculate the variability of your data.
There are two types of variance and standard deviation in terms of Sample and Population. First their formula has been given. Then, what is the difference between sample and population has been discussed below.
Here is the formula for sample and population variance and standard deviation. There is slight difference observe them carefully.
Where: • X is individual one value • N is size of population • x̄ is the mean of population
Calculate the mean x̄.
Subtract the mean from each observation. X- x̄
Square each of the resulting observations. (X- x̄) ^2
Add these squared results together.
Divide this total by the number of observations n (in case of population) to get
variance S2. If you are calculating sample variance then divide by n-1.
Use the positive square root to get standard deviation S.
Here, N =11 N-1=10 Mean (x̄) =15 Sample variance ( s² ) = 639.74/10 = 63.97 Population ( σ² ) = 639.74/11 = 58.16 S = 8.00 σ = 7.6
If variance is high, that means you have larger variability in your dataset. In the other
way, we can say more values are spread out around your mean value.
Standard deviation represents the average distance of an observation from the mean
The larger the standard deviation, larger the variability of the data.
The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the Greek letter sSigma) for population standard deviation and S for sample standard deviation. It is the square root of the Variance.
The primary task of inferential statistics (or estimating or forecasting) is making an opinion about something by using only an incomplete sample of data.
In statistics, it is very important to distinguish between population and sample. A population is defined as all members (e.g. occurrences, prices, annual returns) of a specified group. Population is the whole group.
A sample is a part of a population that is used to describe the characteristics (e.g. mean or standard deviation) of the whole population. The size of a sample can be less than 1%, or 10%, or 60% of the population, but it is never the whole population. As both sample and population are not same thing therefore slight difference is there in their formula.
To get rid of negatives so that negative and positive don’t cancel each other when added together.
+5 -5 = 0
The variance of a variable describes how much the values are spread. The covariance is a measure that tells the amount of dependency between two variables.
A positive covariance means that the values of the first variable are large when values of the second variables are also large. A negative covariance means the opposite: large values from one variable are associated with small values of the other.
The covariance value depends on the scale of the variable so it is hard to analyze it. It is possible to use the correlation coefficient that is easier to interpret. The correlation coefficient is just the normalized covariance.
For data scientists, checking correlations is an important part of the exploratory data analysis process. This analysis is one of the methods used to decide which features affect the target variable the most, and in turn, get used in predicting this target variable. In other words, it’s a commonly-used method for feature selection in machine learning.
And because visualization is generally easier to understand than reading tabular data, heat maps are typically used to visualize correlation matrices.
The two most discussed scaling methods are Normalization and Standardization. Normalization typically means re-scaling the values into a range of [0,1]. Standardization typically means re-scaling data to have a mean of 0 and a standard deviation of 1 (unit variance).
References: https://www.analyticsvidhya.com/blog/2017/05/comprehensive-guide-to-linear-algebra/ https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-data-statistics/mean-and-median/v/mean-median-and-mode