Assignment
Summary Statistics with Python
What is statistics?
Statistics
the practice and study of collecting and analyzing data
Summary Statistic - A fact about or summary of some data
Example
How likely is someone to purchase a product? Are people more likely to purchase it if they can use a different payment system?
How many occupants will your hotel have? How can you optimize occupancy?
How many sizes of jeans need to be manufactured so they can fit 95% of the population? Should the same number of each size be produced?
A/B tests: Which ad is more effective in getting people to purchase product?
Type of statistics
Descriptive statistics
Describe and summarize data
Inferential statistics
Use a sample of data to make inferences about a larger population
Type of data
Numeric (Quantitative)
Continuous (Measured)
Discrete (Counted)
Categorical (Qualitative)
Nomial (Unordered)
Ordinal (Ordered)
Measures of center
Mean and median
In this module, you'll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption
dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption
) as well as information about the carbon footprint of that food category (co2_emissions
) measured in kilograms of carbon dioxide, or CO_2CO2, per person per year in each country.
In this exercise, you'll compute measures of center to compare food consumption in the US and Belgium.
The output should look as follows;
1
Argentina
pork
10.51
37.20
2
Argentina
poultry
38.66
41.53
3
Argentina
beef
55.48
1712.00
4
Argentina
lamb_goat
1.56
54.63
5
Argentina
fish
4.36
6.96
or
Mean vs. median
You learned that the mean is the sum of all the data points divided by the total number of data points, and the median is the middle value of the dataset where 50% of the data is less than the median, and 50% of the data is greater than the median. In this exercise, you'll compare these two measures of center.
Measures of spread
Variance
Average distance from each data point to the data's mean
Standard deviation
Mean absolute deviation
Standard deviation vs. mean absolute deviation
Standard deviation squares distances, penalizing longer distances more than shorter ones
Mean absolute deviation penalizes each distance equally
Quantiles (Or percentiles)
Interquartile range (IQR)
Height of the box in a boxplot
Outliers
Data point that is substantially different from the others
Quartiles, quantiles, and quintiles
Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set. For example, you might want to give a discount to the 10% most active users on a website.
In this exercise, you'll calculate quintiles which split up a dataset into 6 pieces.
Variance and standard deviation
Variance and standard deviation are two of the most common ways to measure the spread of a variable, and you'll practice calculating these in this exercise. Spread is important since it can help inform expectations. For example, if a salesperson sells a mean of 20 products a day, but has a standard deviation of 10 products, there will probably be days where they sell 40 products, but also days where they only sell one or two. Information like this is important, especially when making predictions.
Finding outliers using IQR
This is not homework but additional insight. Please carefully observe the steps.
Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that's less influenced by outliers. IQR is also often used to find outliers. If a value is less than Q1−1.5×IQRQ1−1.5×IQR or greater than Q3+1.5×IQRQ3+1.5×IQR, it's considered an outlier.In this exercise, you'll calculate IQR and use it to find some outliers.
Last updated