maths-stats

Data and Data Types

Data Types are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it. In this section will introduce you to the different data types you need to know, to do proper exploratory data analysis (EDA), which is one of the most underestimated parts of a machine learning project.

Introduction to Data Types

Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA), since you can use certain statistical measurements only for specific data types.

You also need to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables. We will discuss the main types of variables and look at an example for each. We will sometimes refer to them as measurement scales.

Categorical Data

Categorical data represents characteristics. Therefore it can represent things like a person’s gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning.

Nominal Data

Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as „labels“. Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:

The left feature that describes if a person is married would be called "dichotomous", which is a type of nominal scales that contains only two categories.

Ordinal Data

Ordinal values represent discrete and ordered units. It is therefore nearly the same as nominal data, except that it’s ordering matters. You can see an example below:

Note that the difference between Elementary and High School is different than the difference between High School and College. This is the main limitation of ordinal data, the differences between the values is not really known. Because of that, ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction and so on.

Numerical Data

Discrete Data

We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips.

You can check by asking the following two questions whether you are dealing with discrete data or not: Can you count it and can it be divided up into smaller and smaller parts?

Continuous Data

Continuous Data represents measurements and therefore their values can’t be counted but they can be measured. An example would be the height of a person, which you can describe by using intervals on the real number line.

Interval Data

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains temperature of a given place like you can see below:

The problem with interval values data is that they don’t have a „true zero“. That means in regards to our example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied.

Ratio Data

Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

Why Data Types are important?

Data types are an important concept because statistical methods can only be used with certain data types. You have to analyze continuous data differently than categorical data otherwise it would result in a wrong analysis. Therefore knowing the types of data you are dealing with, enables you to choose the correct method of analysis.

We will now go over every data type again but this time in regards to what statistical methods can be applied. To understand properly what we will now discuss, you have to understand the basics of descriptive statistics.

Statistical Methods

Nominal Data

When you are dealing with nominal data, you collect information through:

Frequencies: The Frequency is the rate at which something occurs over a period of time or within a data set.

Proportion: You can easily calculate the proportion by dividing the frequency by the total number of events. (e.g how often something happened divided by how often it could happen)

Visualisation Methods: To visualise nominal data you can use a pie chart or a bar chart.

In Data Science, you can use one hot encoding, to transform nominal data into a numeric feature.

Ordinal Data

When you are dealing with ordinal data, you can use the same methods like with nominal data, but you also have access to some additional tools. Therefore you can summarise your ordinal data with frequencies, proportions, percentages. And you can visualise it with pie and bar charts. Additionally, you can use percentiles, median, mode and the interquartile range to summarise your data.

In Data Science, you can use one label encoding, to transform ordinal data into a numeric feature.

Continuous Data

When you are dealing with continuous data, you can use the most methods to describe your data. You can summarise your data using percentiles, median, interquartile range, mean, mode, standard deviation, and range.

Visualisation Methods: To visualise continuous data, you can use a histogram or a box-plot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box-plots.

Normal Distribution

The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.

The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.

Below you’ll learn how to use the normal distribution, about its parameters, and how to calculate Z-scores to standardize your data and find probabilities.

Example of Normally Distributed Data: Heights

Height data are normally distributed. The distribution in this example fits real data that I collected from 14-year-old girls during a study.

As you can see, the distribution of heights follows the typical pattern for all normal distributions. Most girls are close to the average (1.512 meters). Small differences between an individual’s height and the mean occur more frequently than substantial deviations from the mean. The standard deviation is 0.0741m, which indicates the typical distance that individual girls tend to fall from mean height.

The distribution is symmetric. The number of girls shorter than average equals the number of girls taller than average. In both tails of the distribution, extremely short girls occur as infrequently as extremely tall girls.

Parameters of the Normal Distribution

As with any probability distribution, the parameters for the normal distribution define its shape and probabilities entirely. The normal distribution has two parameters, the mean and standard deviation. The normal distribution does not have just one form. Instead, the shape changes based on the parameter values, as shown in the graphs below.

Mean

The mean is the central tendency of the distribution. It defines the location of the peak for normal distributions. Most values cluster around the mean. On a graph, changing the mean shifts the entire curve left or right on the X-axis.

Standard deviation

The standard deviation is a measure of variability. It defines the width of the normal distribution. The standard deviation determines how far away from the mean the values tend to fall. It represents the typical distance between the observations and the average.

On a graph, changing the standard deviation either tightens or spreads out the width of the distribution along the X-axis. Larger standard deviations produce distributions that are more spread out.

When you have narrow distributions, the probabilities are higher that values won’t fall far from the mean. As you increase the spread of the distribution, the likelihood that observations will be further away from the mean also increases.

Population parameters versus sample estimates

The mean and standard deviation are parameter values that apply to entire populations. For the normal distribution, statisticians signify the parameters by using the Greek symbol μ (mu) for the population mean and σ (sigma) for the population standard deviation.

Unfortunately, population parameters are usually unknown because it’s generally impossible to measure an entire population. However, you can use random samples to calculate estimates of these parameters. Statisticians represent sample estimates of these parameters using x̅ for the sample mean and s for the sample standard deviation.

In statistics, a population is the complete set of all objects or people of interest. Typically, studies definite their population of interest at the outset. Populations can have a finite size but potentially very large size. For example, all valves produced by a specific manufacturing plant or all adult females in the United States, all smokers

Populations can also have an infinite size. For example, infinite populations are used for all possible results of a sequence of trials, such as flipping a coin.

Common Properties for All Forms of the Normal Distribution

Despite the different shapes, all forms of the normal distribution have the following characteristic properties.

They’re all symmetric. The normal distribution cannot model skewed distributions.
The mean, median, and mode are all equal.
Half of the population is less than the mean and half is greater than the mean.
The Empirical Rule allows you to determine the proportion of values that fall within certain distances from the mean. More on this below!

The Empirical Rule for the Normal Distribution

When you have normally distributed data, the standard deviation becomes particularly valuable. You can use it to determine the proportion of the values that fall within a specified number of standard deviations from the mean. For example, in a normal distribution, 68% of the observations fall within +/- 1 standard deviation from the mean. This property is part of the Empirical Rule, which describes the percentage of the data that fall within specific numbers of standard deviations from the mean for bell-shaped curves.

Let’s look at a pizza delivery example. Assume that a pizza restaurant has a mean delivery time of 30 minutes and a standard deviation of 5 minutes. Using the Empirical Rule, we can determine that 68% of the delivery times are between 25-35 minutes (30 +/- 5), 95% are between 20-40 minutes (30 +/- 25), and 99.7% are between 15-45 minutes (30 +/-35). The chart below illustrates this property graphically.

Standard Normal Distribution and Standard Scores

As we’ve seen above, the normal distribution has many different shapes depending on the parameter values. However, the standard normal distribution is a special case of the normal distribution where the mean is zero and the standard deviation is 1. This distribution is also known as the Z-distribution.

A value on the standard normal distribution is known as a standard score or a Z-score. A standard score represents the number of standard deviations above or below the mean that a specific observation falls. For example, a standard score of 1.5 indicates that the observation is 1.5 standard deviations above the mean. On the other hand, a negative score represents a value below the average. The mean has a Z-score of 0.

Suppose you weigh an apple and it weighs 110 grams. There’s no way to tell from the weight alone how this apple compares to other apples. However, as you’ll see, after you calculate its Z-score, you know where it falls relative to other apples.

Standardization: How to Calculate Z-scores

Standard scores are a great way to understand where a specific observation falls relative to the entire distribution. They also allow you to take observations drawn from normally distributed populations that have different means and standard deviations and place them on a standard scale. This standard scale enables you to compare observations that would otherwise be difficult.

This process is called standardization, and it allows you to compare observations and calculate probabilities across different populations. In other words, it permits you to compare apples to oranges. Isn’t statistics great!

To standardize your data, you need to convert the raw measurements into Z-scores.

To calculate the standard score for an observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:

X represents the raw value of the measurement of interest. Mu and sigma represent the parameters for the population from which the observation was drawn.

After you standardize your data, you can place them within the standard normal distribution. In this manner, standardization allows you to compare different types of observations based on where each observation falls within its own distribution.

Example of Using Standard Scores to Make an Apples to Oranges Comparison

Suppose we literally want to compare apples to oranges. Specifically, let’s compare their weights. Imagine that we have an apple that weighs 110 grams and an orange that weighs 100 grams.

If we compare the raw values, it’s easy to see that the apple weighs more than the orange. However, let’s compare their standard scores. To do this, we’ll need to know the properties of the weight distributions for apples and oranges. Assume that the weights of apples and oranges follow a normal distribution with the following parameter values:

Now we’ll calculate the Z-scores:

Apple = (110-100) / 15 = 0.667
Orange = (100-140) / 25 = -1.6

The Z-score for the apple (0.667) is positive, which means that our apple weighs more than the average apple. It’s not an extreme value by any means, but it is above average for apples. On the other hand, the orange has fairly negative Z-score (-1.6). It’s pretty far below the mean weight for oranges. I’ve placed these Z-values in the standard normal distribution below.

While our apple weighs more than our orange, we are comparing a somewhat heavier than average apple to a downright puny orange! Using Z-scores, we’ve learned how each fruit fits within its own distribution and how they compare to each other.

Finding Areas Under the Curve of a Normal Distribution

The normal distribution is a probability distribution. As with any probability distribution, the proportion of the area that falls under the curve between two points on a probability distribution plot indicates the probability that a value will fall within that interval. To learn more about this property, read my post about Understanding Probability Distributions.

Typically, I use statistical software to find areas under the curve. However, when you’re working with the normal distribution and convert values to standard scores, you can calculate areas by looking up Z-scores in a Standard Normal Distribution Table.

Because there are an infinite number of different normal distributions, publishers can’t print a table for each distribution. However, you can transform the values from any normal distribution into Z-scores, and then use a table of standard scores to calculate probabilities.

Using a Table of Z-scores

Let’s take the Z-score for our apple (0.667) and use it to determine its weight percentile. A percentile is the proportion of a population that falls below a specific value. Consequently, to determine the percentile, we need to find the area that corresponds to the range of Z-scores that are less than 0.667. In the portion of the table below, the closest Z-score to ours is 0.65, which we’ll use.

The trick with these tables is to use the values in conjunction with the properties of the normal distribution to calculate the probability that you need. The table value indicates that the area of the curve between -0.65 and +0.65 is 48.43%. However, that’s not what we want to know. We want the area that is less than a Z-score of 0.65.

We know that the two halves of the normal distribution are mirror images of each other. So, if the area for the interval from -0.65 and +0.65 is 48.43%, then the range from 0 to +0.65 must be half of that: 48.43/2 = 24.215%. Additionally, we know that the area for all scores less than zero is half (50%) of the distribution.

Therefore, the area for all scores up to 0.65 = 50% + 24.215% = 74.215%

Our apple is at approximately the 74th percentile.

Below is a probability distribution plot produced by statistical software that shows the same percentile along with a graphical representation of the corresponding area under the curve. The value is slightly different because we used a Z-score of 0.65 from the table while the software uses the more precise value of 0.667.

Hypothesis Testing

Gathering data in itself is meaningless unless we can analyze it and draw powerful insights. What makes data interesting is the ability to evaluate and interpret it.

Hypothesis testing refers to a term in statistics where we, as the analysts, evaluate an assumption related to a data set parameter.

Based on the purpose of the analysis and the specific characteristics of the data, we can use different methodologies. In general, the technique gives us a standardized way to assess the plausibility of an assumption based on sample data.

This sample data can originate either from a larger population or a data-generating process.

What is a Hypothesis

Essentially, it is an educated guess, which we can test with observations or by experimenting. It can be anything, so long as it is testable.

When we propose a hypothesis, we write a hypothesis statement.

Generally, we strive to keep this in the form of ‘If… then…’. More specifically,

‘If A happens to an independent variable, then B will happen to the dependent variable.’

There are some characteristics to a well-written statement:

As mentioned, it’s an if-then statement;
We can test the statement scientifically;
It states both the independent and dependent variables.

First, we define the problem we are analyzing, and then we base our hypothesis statement on this problem.

It is crucial to remember that the underlying assumption can be about any parameter of the population, and it can either be true or not.

What is Hypothesis Testing

The best way to evaluate a hypothesis would be to review the entire population of the data we are analyzing. However, this usually proves to be highly impractical, if not wholly impossible. Therefore, we typically assess only a randomly selected sample instead of the entire population.

And if the data within the sample is not consistent with our hypothesis, we can reject it.

When we perform statistical analysis, we test a hypothesis by evaluating a random sample of the entire population. Practically, we test two hypotheses:

The null hypothesis (H0)
The alternative hypothesis (HA)

The null hypothesis usually assumes equality in parameters of the population, like the mean of the population is equal to zero. The alternative hypothesis then will be the exact opposite — the mean does not equal zero.

The null and alternative hypotheses have to be mutually exclusive. Only one can be correct, but one of the two is always right.

The null hypothesis is usually the accepted fact — the mean equals zero, smoking causes cancer, loud music hurts your ears, and others. When we look at the randomly selected sample, we usually consider that the null hypothesis is that the observations are simply the result of chance. The alternative view is then that they are affected by a non-random cause.

The Four Steps of Hypothesis Testing

We can present the process of data-driven decision making in four steps:

State the two hypotheses (null and alternative) in a way that only one can be true;
Plan how to evaluate the data and prepare the analysis plan, outlining how we will use the sample to assess the population. It is common to focus on a single parameter (e.g., mean, standard deviation, p-value, z-score, and others);
Evaluate the sample data and calculate the value of the test statistics, as described in the analysis plan;
Assess the results by applying the decision rules from the plan. Here we either accept the null hypothesis as plausible or reject it in favor of the alternative hypothesis

Decision Rules

One of the most important things we need to define in our analysis plan is the set of decision rules for rejecting the null hypothesis, to be used in our assessment. In practice, these can be specified in two ways –referring to either a p-value or a region of acceptance.

A p-value measures the significance of the evidence in support of the null hypothesis. It is the probability of observing the test statistic with the assumption that the null is true. If the p-value is less than the significance level (our threshold), we reject the null hypothesis.

An acceptance region is a set of values for the test statistic. If it falls within those values, we fail to reject the null hypothesis. And values outside the region of acceptance fall within the region of rejection. If our test statistic ends up here, we reject the null. We can then say that we reject the null hypothesis at the α level of significance.

Accepting or Failing to Reject

The testing has one of two outcomes – we accept the null hypothesis, or we reject it. However, most statisticians prefer to say they reject the null or fail to reject it instead of accepting it.

The idea behind is that saying we accept the null hypothesis means we deem it to be true while saying we fail to reject means we did not find the data to be persuasive enough to select the alternative over the null. Because we are performing a probabilistic test, there’s always a small chance of being wrong, and this different wording covers that.

Errors

When we evaluate a hypothesis, we can end up with one of two types of errors:

Type I

This is when we reject the null hypothesis, but it is true. The probability of making a Type I error is the significance level, also called alpha (denoted α). In financial modeling and analysis, we would usually set alpha at 5% or 0.05. A smaller alpha (like 1%, or 0.1%) suggests a more robust evaluation of the null hypothesis.

Type II

We make this error when we fail to reject the null hypothesis, but it is false. The probability of making a Type II error is the beta, denoted β. On the other hand, the chance of not making such an error is called the power of the test.

Interpreting the Results

We evaluate the p-value to portrait a finding as statistically significant by comparing the value of the statistical test to the predefined alpha level. If the p-value is less than the predefined threshold, then it has statistical significance.

From the perspective of hypothesis testing, if the p-value is less than (or equal to) the alpha, we reject the null hypothesis (significant result). If the p-value is higher than the alpha, we fail to reject the null (insignificant result).

The confidence level of the hypothesis for the observed data can be calculated as one minus alpha (1 – α). Knowing this, we have two ways to write up our conclusions.

Fail to reject the null hypothesis at a 5% significance level; or
Fail to reject the null hypothesis at a 95% confidence level.

When we interpret the p-value, it does not mean the null is true or false. It only means we have chosen to reject (or fail to reject) the null hypothesis at a specific confidence level based on the sample observations of the data. We cannot make binary decisions as we only rely on a probabilistic approach.

Critical Values

Instead of p-values, some tests may return a list of critical values, with their respective significance levels, and also a test statistic. We usually get such results in distribution-free hypothesis testing. However, the choice between p-value and critical values happens as part of the initial test design.

We similarly assess them by comparing the test statistic to the critical value at a chosen significance level. If the test statistic is higher than (or equal to) the critical value, we reject the null hypothesis. And conversely, if the test statistic is less than the critical value, we fail to reject the null.

We present the results in the same way as with p-values.

Parametric Tests

Parametric tests are used for the following cases:

Quantitative Data
Continuous variable
When data is measured on approximate interval or ratio scales of measurement.
When data should follow Normal Distribution

Types Of Parametric Tests

t-test (n< 30), which is further classified into 1-sample and 2-sample
Anova (Analysis of Variance)- One way Anova, Two way Anova
Pearson’s r Correlation
Z-test for large samples (n> 30)

Student’s T-Test

This test was developed by Prof. W.S.Gossett in 1908, who published statistical papers under the pen name of ‘Student’. Thus the test is known as Student’s T-test.

Indications for the test:

When samples are small
Population Variances are not known.

Uses:

Compare two means of small independent samples
Compare Sample mean and Population mean
Compare two proportions of small independent samples.

Assumptions:

Samples are randomly selected
Data utilised is quantitative
Variable follow Normal distribution
Sample variance are mostly same in both the groups under study
Samples are small, mostly lower than 30.

A t-test compares the difference between two means of different groups to determine whether the difference is statistically significant.

Analysis of Variance(ANOVA)

Given by Sir Ronald Fisher
The principle aim of statistical models is to explain the variation in measurements.
The statistical model involving a test of significance of the difference in mean values of the variable between two groups is the student’s,’t’ test. If there are more than two groups, the appropriate statistical model is Analysis of Variance (ANOVA)

Assumptions:

Sample population can be easily approximated to normal distribution.
All populations have same Standard Deviation.
Individuals in population are selected randomly.
Independent samples.

ANOVA compares variance by means of a simple ratio, called F-Ratio, F= (Variance between groups) / (Variance within groups).
The resulting F statistics is then compared with critical value of F (critic), obtained from F tables in much the same way as was done with ‘t’.
If the calculated value exceeds the critical value for the appropriate level of α, the null hypothesis will be rejected.

An F test is therefore a test of the Ratio of Variances. F Tests can also be used on their own, independently of the ANOVA technique, to test hypothesis about variances.

In ANOVA, the F test is used to establish whether a statistically significant difference exists in the data being tested.

ANOVA is further divided into:

One way ANOVA
Two way ANOVA

One Way ANOVA

If the various experimental groups differ in terms of only one factor at a time- a One way ANOVA is used. e.g. A study to assess the effectiveness of four different antibiotics on S Sanguis.

Two Way ANOVA

If the various groups differ in terms of two or more factors at a time, then a Two way ANOVA is performed. e.g. A study to assess the effectiveness of four different antibiotics on S Sanguis in three different age groups.

Pearson’s Correlation Coefficient

Correlation is a technique for investigating the relationship between two quantitative, continuous variables
Pearson’s Correlation Coefficient(r) is a measure of the strength of the association between the two variables.

Assumptions:

Subjects selected for study with pair of values of X & Y are chosen with random sampling procedure.
Both X & Y variables are continuous
Both variables X & Y are assumed to follow normal distribution.

Steps:

The first step in studying the relationship between two continuous variables is to draw a scatter plot of the variables to check for linearity.
The correlation coefficient should not be calculated of the relationship is not linear.
For correlation only purposes, it does not matter on which axis the variables are plotted.

However, conventionally, the independent variable is plotted on X axis and dependent variable on Y-axis.

The nearer the scatter of points is to a straight line, the higher the strength of association between the variables.

Types of Correlation:

Perfect Positive Correlation: r=+1

Partial Positive Correlation: 0<r<+1

Perfect negative correlation: r=-1

Partial negative correlation: 0>r>-1

No Correlation: r=0

Z- Test

This test is used for testing significance difference between two means (n>30).

Assumptions:

The sample must be randomly selected.
Data must be quantitative.
Samples should be larger than 30.
Data should follow normal distribution.
Sample variances should be almost the same in both the groups of study.

If the SD of the populations is known, a Z test can be applied even if the sample is smaller than 30.

Indications:

To compare sample mean with population mean.
To compare two sample means.
To compare sample proportion with population proportion.
To compare two sample proportions.

Steps:

Defining the problem
Stating the null hypothesis (H0) against the alternate hypothesis (H1)
Finding Z value, Z= (Observed mean)/(Mean Standard Error).
Fixing the level of significance
Comparing calculated Z value with the value in Z table at corresponding degree significance level.

If the observed Z value is greater than theoretical Z value, Z is significant, we reject the null hypothesis and accept the alternate hypothesis.

Z- Proportionality Test

Used for testing the significant difference between two proportions.

One tailed and Two tailed Z tests

Z values on each side of mean are calculated as +Z or as -Z.
A result larger than difference between sample mean will give +Z and result smaller than the difference between mean will give -Z.

E.g. for two tailed test:

In a test of significance, when one wants to determine whether the mean IQ of malnourished children is different from that of well nourished and does not specify higher or lower, the P value of an experiment group includes both sides of extreme results at both ends of scale, and the test is called two tailed test.

E.g. for single tailed:

In a test of significance when one wants to know specifically whether a result is larger or smaller than what occur by chance, the significant level or P value will apply to relative end only e.g. if we want to know if the malnourished have lesser mean IQ than the well nourished, the result will lie at one end (tail) of the distribution, and the test is called single tailed test.

Conclusion

Tests of significance play an important role in conveying the results of any research & thus the choice of an appropriate statistical test is very important as it decides the fate of outcome of the study.
Hence the emphasis placed on tests of significance in clinical research must be tempered with an understanding that they are tools for analyzing data & should never be used as a substitute for knowledgeable interpretation of outcomes.

Non-Parametric Tests in Hypothesis Testing

Z-test, Student’s T-test, Paired T-test, ANOVA, MANOVA... Actually they all belong to the Parametric statistics family which assumes that sample data come from a population that can be adequately modeled by a probability distribution that has a fixed set of parameters (mean, standard deviation) aka Normal Distribution.

Parametric tests usually assume three things:

Independence of cases: samples are independent observations
Normality: sample data come from a normal distribution (or at least is symmetric)
Homogeneity of variances: sample data come from a population with the same variance

However, in real life, these assumptions can hardly be met. Non-Parametric Tests have much more relaxed assumptions and they are either distribution-free or having a specified distribution but with the distribution’s parameters unspecified.

When do we use Non-Parametric Tests?

Imagine we were comparing the average quantity of condiment that consumers buy when there is Discount vs No Discount. Apparently these two sample distributions are not parametric. Does that mean that I cannot use parametric tests? Not necessarily.

Don’t forget Central Limit Theorem! A sample that looks non-symmetric does not necessarily mean the population is not normally distributed.

Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough. How large is large enough? Usually, it is safe to have a sample size of at least 30 when the population distribution is roughly bell-shaped.

Test for Normality

If the sample size is small (less than 30), the first step is always to test the normality of the population. Kolmogorov-Smirnov Test (KS Test) can be used for that! The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples.

The null hypothesis of the KS test is that the sample is drawn from the reference distribution. In Python, it is very easy with Scipy library.

import scipy.stats as stats
t, pvalue = stats.kstest(sample, 'norm')

Let’s say your alpha level is 0.05. If the p-value is larger than 0.05, that means that you cannot reject the null. You can say that the sample is from Normal Distribution with a confidence level of 95%.

Test for Equality of Variance

If your sample is large enough, it is actually more important to test for equality of variance (homoscedasticity). Levene’s test is designed for that. Levene’s test can be used to assess the equality of variances for a variable for two or more groups. The null hypothesis of the Levene’s test is that samples are drawn from the populations with the same variance. If a significant result is observed, one should switch to tests like Welch’s T-test or other non-parametric tests.

The Python code is below:

import scipy.stats as stats
t, pvalue = stats.levene(sample1, sample2, ..., center = 'mean')

Note that ‘center’ can be mean, median, or trimmed.

Commonly used Non-parametric Tests

Kolmogorov-Smirnov Test (KS Test for 2 samples)

This is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution. If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.

import scipy.stats as stats
t, pvalue = stats.ks_2samp(sample1, sample2)

Mann-Whitney U Test (Nonparametric version of 2-sample t test)

Mann-Whitney U test is commonly used to compare differences between two independent groups when the dependent variable is not normally distributed. It is often considered the nonparametric alternative to the independent t-test. The null hypothesis of Mann-Whitney U is that two independent samples were selected from populations that have the same distribution.

import scipy.stats as stats
t, pvalue = stats.mannwhitneyu(sample1,sample2,alternative=None)

Krusal-Wallis H Test (KW Test — Nonparametric version of one-way ANOVA)

The Krusal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. A significant Kruskal–Wallis test indicates that at least one sample stochastically dominates one other sample. The test does not identify where this stochastic dominance occurs or for how many pairs of groups stochastic dominance obtains. Therefore, post-hoc comparisons between groups are required to determine which groups are different.

import scipy.stats as stats
t, pvalue = stats.kruskal(sample1, sample2, ...，)

As a recap, before selecting any test methods, always make sure you have a solid and clear hypothesis statement which specifies if you are testing mean or the distribution.

Then make sure your sample data and the population meet the assumptions of the test you are going to do. Normality and variances are something to consider. Then if you end up with non-parametric tests, have fun and interpret correctly!

References:

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

https://statisticsbyjim.com/basics/normal-distribution/

https://medium.com/magnimetrics/hypothesis-testing-for-complete-beginners-7045c87efefa

https://medium.com/the-owl/parametric-tests-d4f5c26ddf13

https://towardsdatascience.com/non-parametric-tests-in-hypothesis-testing-138d585c3548

PreviousML Nextmaths-stats

Last updated 3 years ago

Was this helpful?