Copy of Week 16
Last updated
Last updated
In Data Science, we aim to do different experiments with raw data and finds some good insights from the data. To drive any business on the right path, data is very important or we can say that “Data is the fuel”. It can at least provide some actionable insights that can help to:
•Strategize current campaigns,
•Easily organize the launch of new products, or
•Try out different experiments.
In all the above-mentioned things, the one common driving component is Data. We are entering into the digital era where we produce a lot of Data every day.
For Example, On a daily basis, a company like Flipkart produces more than 2- TB of data.
Due to so much importance of data in our life, it becomes very crucial to properly store and process this data without any error. While dealing with datasets, the data type or category of the data plays an important role to find the answer to the questions below:
•Which preprocessing strategy would work for a particular set to get the right results, or
• Which type of statistical analysis should be applied for the best results.
So, In this article, we will discuss the different data types in statistics you need to know to do proper Exploratory Data Analysis (EDA), which is one of the most important components in the pipeline of a Machine Learning Project.
Introduction to Data Types in Statistics and their Importance
Qualitative vs Quantitative Data
Qualitative Data
•Nominal Data
•Ordinal Data
4.Quantitative Data
•Discrete Data
•Continuous Data
•Interval Data
•Ratio Data
In Statistics, Data Types play a very crucial and important role, which needs to be understood, to apply statistical measurements correctly to your data so that we can correctly conclude certain assumptions about the data.
Similarly, we need to know which data analysis and its type you are working on to select the correct perception technique since different data types are considered as an approach to arrange various types of variables.
While doing Exploratory Data Analysis (EDA) in a general data science project, it becomes crucial to have a good understanding of the different data types since we can use certain statistical measurements only for specific data types.
It is also known as the Measurement Scale.
While dealing with any of the data types, we also need to know which visualization method fits the particular data type. We can think of data types as a way to categorize different types of variables.
Quantitative Data
These types of data seem to be the easiest to explain. It tries to find the answers to questions such as
•“how many,
•“how much” and
•“how often”
2.It can be expressed as a number, so it can be quantified. In simple words, it can be measured by numerical variables.
3.These are easily open for statistical manipulation and can be represented by a wide variety of statistical types of graphs and charts like line charts, bar graphs, scatter plots, etc.
Examples of quantitative data:
•Scores of tests and exams e.g. 74, 67, 98, etc.
•The weight of a person.
•The temperature in a room.
There are 2 general types of quantitative data:
•Discrete data
•Continuous data
Qualitative data can’t be expressed as a number, so it can’t be measured. It mainly consists of words, pictures, and symbols, but not numbers.
It is also known as Categorical Data as the information can be sorted by category, not by number.
These can answer the questions like:
•“how this has happened”, or
•“why this has happened”.
Examples of qualitative data:
•Colors e.g. the color of the sea
•Popular holiday destinations such as Switzerland, New Zealand, South Africa, etc.
•Ethnicity such as American Indian, Asian, etc.
In general, there are 2 types of qualitative data:
•Nominal data
•Ordinal data.
This data type is used just for labeling variables, without having any quantitative value. Here, the term ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
It just names a thing without applying for any particular order. The nominal data sometimes referred to as “labels”.
Examples of Nominal Data:
•Gender (Women, Men)
•Hair color (Blonde, Brown, Brunette, Red, etc.)
•Marital status (Married, Single, Widowed)
As you can observe from the examples there is no intrinsic ordering to the variables.
Eye color is a nominal variable having a few levels or categories such as Blue, Green, Brown, etc and there is no possible way to order these categories in a rank-wise manner i.e, from highest to lowest or vice-versa.
The crucial difference from nominal types of data is that Ordinal Data shows where a number is present in a particular order.
This type of data is placed into some kind of order by their position on a scale. Ordinal data may indicate superiority.
We cannot do arithmetic operations with ordinal data because they only show the sequence.
Ordinal variables are considered as “in-between” qualitative and quantitative variables.
In simple words, we can understand the ordinal data as qualitative data for which the values are ordered.
In comparison with nominal data, the second one is qualitative data for which the values cannot be placed in an order.
Based on the relative position, we can also assign numbers to ordinal data. But we cannot do math with those numbers. For example, “first, second, third… etc.”
•Ranking of users in a competition: The first, second, and third, etc.
•Rating of a product taken by the company on a scale of 1-10.
•Economic status: low, medium, and high.
It shows the count that involves only integers and we cannot subdivide the discrete values into parts.
For Example, the number of students in a class is an example of discrete data since we can count whole individuals but can’t count like 2.5, 3.75, kids.
2. In simple words, discrete data can take only certain values and the data variables cannot be divided into smaller parts.
3. It has a limited number of possible values e.g. days of the month.
Examples of discrete data:
•The number of students in a class.
•The number of workers in a company.
•The number of test questions you answered correctly.
It represents the information that could be meaningfully divided into its finer levels. It can be measured on a scale or continuum and can have almost any numeric value. For Example, We can measure our height at very precise scales in different units such as meters, centimeters, millimeters, etc.
The key difference between continuous and discrete types of data is that in the former, we can record continuous data at so many different measurements such as width, temperature, time, etc.
The continuous variables can take any value between two numbers. For Example, between the range of 60 and 82 inches, there are millions of possible heights such as 62.04762 inches, 79.948376 inches, etc.
A good great rule for defining if data is continuous or discrete is that if the point of measurement can be reduced in half and still make sense, the data is continuous.
Examples of continuous data:
•The amount of time required to complete a project.
•The height of children.
•The speed of cars.
These data types are measurable and ordered with the nearest items but have no meaningful zero. Let’s understand the meaning of “Interval Scale”: In the Interval scale, the term ‘Interval’ signifies space in between, which is a significant thing to recall as interval scales not only educate us about the order but in addition, give information about the value between every item.
Fundamentally, we can show interval data in the same way as ratio data, but the thing that we have to note is their characterized zero points.
Hence, with the help of interval data, we can easily correlate the degrees of the data and also add or subtract the values.
There are some descriptive statistics that we can calculate for interval data such as :
•Central measures of tendency (mean, median, mode)
•Range (minimum, maximum)
•Spread (percentiles, interquartile range, and standard deviation).
These are not the only statistical things to be calculated, but we can calculate more things also.
•Temperature (°C or F, but not Kelvin)
•Dates (1055, 1297, 1976, etc.)
•Time Gap on a 12-hour clock (6 am, 6 pm)
These data are also in the ordered units that have the same difference.
Ratio values are the same as interval values, but the only difference is that Ratio data do have an absolute zero. For Example, height, weight, length, etc.
These are measured and ordered with equidistant items with a meaningful zero and never be negative like interval data. Let’s understand this with an outstanding example- Measurement of heights. Height can be measured in units like centimeters, inches, meters, or feet and it is not possible to have a negative value of height.
It enlightens us regarding the order for variables, the contrasts among them, and they have absolutely zero.
Ratio data is fundamentally the same as interval data, aside from zero means none.
The descriptive statistics which we can calculate for ratio data are the same as interval data such as :
•Central measures of tendency (mean, median, mode)
•Range (minimum, maximum)
•Spread (percentiles, interquartile range, and standard deviation).
• Age (from 0 years to 100+)
• Temperature (in Kelvin, but not °C or F)
• Time interval (measured with a stop-watch or similar)
For the above examples of ratio data, we see that there is an actual and meaningful zero-point like the age of a person, absolute zero, distance calculated from a specified point or time all have real zeros.
NOTE:
If we picked the zero-point of the scale subjectively, then at that point the data can’t be ratio data and should be interval data.
Statistics is a branch of mathematics that deals with collecting, interpreting, organization, and interpretation of data.
Initially, when we get the data, instead of applying fancy algorithms and making some predictions, we first try to read and understand the data by applying statistical techniques. By doing this, we are able to understand what type of distribution data has.
This blog aims to answer the following questions:
What is Descriptive Statistics?
Types of Descriptive Statistics?
Measure of Central Tendency (Mean, Median, Mode)
Measure of Spread / Dispersion (Standard Deviation, Mean Deviation, Variance, Percentile, Quartiles, Interquartile Range)
What is Skewness?
What is Kurtosis?
What is Correlation? Today, let’s understand descriptive statistics once and for all. Let’s start,
Descriptive statistics involves summarizing and organizing the data so they can be easily understood. Descriptive statistics, unlike inferential statistics, seeks to describe the data, but does not attempt to make inferences from the sample to the whole population. Here, we typically describe the data in a sample. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory.
Descriptive statistics are broken down into two categories. Measures of central tendency and measures of variability (spread).
Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way “central” to the set.
Mean or Average is a central tendency of the data i.e. a number around which a whole data is spread out. In a way, it is a single number that can estimate the value of the whole data set. Let’s calculate the mean of the data set having 8 integers.
Median is the value that divides the data into 2 equal parts i.e. number of terms on the right side of it is the same as a number of terms on the left side of it when data is arranged in either ascending or descending order. Note: If you sort data in descending order, it won’t affect the median but IQR will be negative. We will talk about IQR later in this blog. Median will be a middle term if the number of terms is odd Median will be the average of the middle 2 terms if a number of terms is even.
The median is 59 which will divide a set of numbers into equal two parts. Since there are even numbers in the set, the answer is the average of middle numbers 51 and 67.
Note: When values are in arithmetic progression (difference between the consecutive terms is constant. Here it is 2.), the median is always equal to the mean.
please also read the second part above!
END OF THE LECTURE