Week 20.5

In this page, we will do a quick introduction to fundamental data visualization libraries, namely matplotlib and seaborn. We split the plot types into three main categories depending on the relationship they display. These categories are evolution, distribution, and correlation. Plots in the evolution category simply indicates how variables change over time or how its relationship with another feature evolves. As implied by its name, the plot types in the distribution category displays the certain features' distribution. Finally the plots explained in correlation display the relationship between two features in the dataset. There are much more plot types that one can use the show those relationship among the features, so to learn more about different plot types, please refer to Python graph gallery ( https://www.python-graph-gallery.com ), which was prepared by Yan Holtz.

Evolution

Line plots:

A line plot, also known as a line chart or line graph, is a type of data visualization that displays the relationship between two continuous variables over a continuous interval or time period.

Line plots are commonly used for time series analysis, trend analysis, and displaying the relationship between continuous variables. They provide a clear and concise visualization of data trends over time or a continuous range, making them effective for understanding patterns, fluctuations, and correlations between variables.

Area chart:

An area chart, also known as an area graph, is a type of data visualization that displays the magnitude and proportion of multiple variables over a continuous interval or time period. It is similar to a line chart but with the area beneath the lines filled in.

Area charts are useful for visualizing the cumulative effect of multiple variables, especially when comparing their proportions or magnitudes. They are commonly used for displaying time series data, market share analysis, and visualizing the composition of a whole in relation to its parts. Area charts provide a clear representation of trends, changes, and comparisons over time or a continuous range.

Distribution

Histogram

A histogram is a graphical representation of the distribution of a dataset. It provides a visual summary of the underlying frequency or occurrence of different values or ranges of values within a dataset. The values are split in bins, each bin is represented as a bar. You can find the key characteristics and components of a histogram below.

  • X-axis and Y-axis: The x-axis represents the range of values or intervals being measured or observed. It is divided into equal bins or intervals that cover the entire range of the dataset. The y-axis represents the frequency or count of occurrences for each bin.

  • Bins or intervals: Bins are the subintervals into which the range of values on the x-axis is divided. Each bin represents a specific range or value, and its width may vary depending on the data and the desired level of detail. The height of each bar in the histogram corresponds to the frequency or count of observations falling within that bin.

  • Frequency or count: The height of each bar in the histogram represents the frequency or count of occurrences within the corresponding bin. It indicates how many data points fall within that specific range or value.

  • Distribution: The shape of the histogram provides insights into the distribution of the data. Common distribution shapes include bell-shaped (normal distribution), skewed to the left or right (positively or negatively skewed), or multimodal (having multiple peaks).

  • Bar widths: The width of each bar in a histogram is proportional to the width of the corresponding bin. The bars are typically drawn adjacent to each other, with no space between them, to indicate the continuity of the data.

Histograms are commonly used in data analysis and exploratory data visualization to understand the distribution and characteristics of a dataset. They provide a visual representation of how the data is spread across different values or intervals. Histograms are particularly useful for identifying patterns, central tendencies, outliers, and assessing the shape of the data distribution.

Density plot

A density plot, also known as a kernel density plot, is a data visualization technique that represents the underlying probability density function of a continuous variable. It provides a smoothed estimate of the distribution of the data.

  • X-axis: The x-axis represents the range of values observed in the dataset for the variable being analyzed. It is labeled and scaled according to the minimum and maximum values of the data.

  • Y-axis: The y-axis represents the estimated probability density. It indicates the likelihood of a value occurring within a specific range.

  • Smoothed density curve: The density plot consists of a smoothed curve that represents the estimated probability density function. The shape of the curve provides insights into the distribution of the data. It is created using kernel density estimation, which involves placing a kernel (typically a Gaussian distribution) on each data point and summing them to obtain a smooth curve.

  • Peaks and troughs: Peaks in the density plot represent areas of higher probability density, indicating regions where the data is more likely to be concentrated. Troughs, on the other hand, indicate areas of lower density.

  • Area under the curve: The area under the density curve represents the total probability of the variable occurring within the observed range. The total area under the curve is equal to 1.

Density plots are commonly used to visualize the shape, spread, and central tendencies of continuous variables. They provide a smooth and continuous representation of the data distribution, making them particularly useful for exploring the overall pattern or density of the data. Density plots allow for easy comparison between multiple distributions and can reveal underlying patterns or deviations in the data.

Boxplot

A boxplot is a graphical representation of the distribution of a dataset. It provides a summary of the data's central tendency, dispersion, and skewness. Here's a description of the components of a boxplot:

  • Median: The line within the box represents the median or the 50th percentile of the data. It indicates the central value of the dataset.

  • Box: The box, typically extending from the lower quartile (25th percentile) to the upper quartile (75th percentile), encloses the interquartile range (IQR). The IQR represents the middle 50% of the data and provides information about the spread or dispersion.

  • Whiskers: The whiskers extend from the box and indicate the range of the data. They typically represent 1.5 times the IQR from the lower and upper quartiles. Data points beyond the whiskers are considered outliers and are usually plotted individually.

  • Outliers: Individual data points that fall outside the whiskers are considered outliers and are shown as individual points or small circles. Outliers can provide insights into extreme or unusual observations within the dataset.

  • Notches: Some boxplots include notches in the boxes. These notches represent the confidence interval around the median and can provide a rough comparison of the medians' statistical significance between groups.

Boxplots are useful for visualizing the distribution and key statistical measures of a dataset. They enable comparisons between different groups or categories and help identify outliers or skewness. By summarizing the data in a compact manner, boxplots provide a clear overview of the dataset's characteristics.

Violin plot

Violin plots allow to visualize the distribution of a numeric variable for one or several groups. It is really close to a boxplot, but allows a deeper understanding of the distribution. It is a type of data visualization that combines aspects of a box plot and a kernel density plot. It displays the distribution of a numerical variable across different categories or groups. Here's a description of the components of a violin plot:

  • Violin shape: The main feature of a violin plot is its violin-shaped curve for each category or group. The width of the curve at any given point represents the estimated density or distribution of the data. Wider sections indicate areas of higher density, while narrower sections indicate lower density.

  • Interquartile range (IQR): Inside each violin shape, a white dot or a thick line represents the median. The width of the box-like structure within the violin corresponds to the IQR, which indicates the middle 50% of the data's distribution.

  • Kernel density estimation: Alongside the violin shapes, a violin plot also displays a smoothed kernel density plot for each category. This plot provides an estimate of the underlying probability density function of the data within each group.

Violin plots offer a visual representation of the distribution and summary statistics of the data, combining aspects of both box plots and kernel density plots. They provide insights into the shape, variability, and central tendencies of the data within different groups, allowing for easy comparisons. Additionally, violin plots can reveal asymmetries or multiple peaks in the data distribution, making them particularly useful for exploratory data analysis and group comparisons

Correlation

Scatter Plot

A scatter plot is a type of data visualization that displays the relationship between two numerical variables. It represents data points as individual dots or markers on a Cartesian coordinate system. Please find the key components of a scatter plot below.

  • Data points: Each data point is represented as a dot or marker on the plot, with its position determined by the values of the variables it represents. The x and y coordinates of a data point correspond to the values of the two variables being examined.

  • Patterns and relationships: Scatter plots help identify patterns or relationships between the two variables. These patterns can include positive correlation (both variables increase or decrease together), negative correlation (one variable increases while the other decreases), or no correlation (no clear relationship between the variables).

  • Clusters and outliers: Scatter plots may reveal clusters of data points, indicating subgroups or patterns within the data. Outliers, which are data points that significantly deviate from the overall trend, can also be identified.

Scatter plots are widely used in exploratory data analysis to understand the relationship between two variables, identify patterns, detect outliers, and explore potential associations. They provide a visual representation that allows for quick interpretation and insights into the data.

Heatmap

A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. It is really useful to display a general view of numerical data, not to extract specific data point.

It is widely used to plot the correlation matrix of a data frame. You can find the explanation of the key characteristics and components of a heatmap below:

  • Color encoding: Each cell in the heatmap represents a value from the underlying matrix. The magnitude of the value is represented by a color, typically using a color gradient or colormap. Higher values are often represented by warmer or brighter colors, while lower values are represented by cooler or darker colors.

  • Matrix representation: The heatmap is organized as a grid, with rows and columns representing the two dimensions of the data being analyzed. The values in the matrix can be numerical, categorical, or even binary, depending on the context and purpose of the visualization.

  • Color scale: The color scale or colormap used in the heatmap maps the range of values in the matrix to a specific color gradient. The color scale typically includes a legend or color bar that provides a reference for interpreting the colors and their corresponding values.

  • Data intensity: The intensity or shading of the color in each cell indicates the relative magnitude or density of the corresponding value. Cells with darker or lighter colors stand out as having higher or lower values, respectively.

Heatmaps are particularly effective when visualizing large datasets and identifying relationships or clusters within the data.

Barplot

A barplot shows the relationship between a numeric and a categoric variable., also known as a bar chart or bar graph, is a data visualization technique that displays categorical data using rectangular bars. It represents the frequency, count, or proportion of different categories or groups.

  • Vertical or horizontal bars: The bars in a bar plot can be oriented vertically or horizontally. Vertical bar plots are more common, with the height of each bar representing the magnitude or frequency of the category it represents. Horizontal bar plots have bars extending horizontally, with the length of each bar indicating the magnitude or frequency.

  • X-axis: The x-axis represents the categories or groups being analyzed. Each category or group is labeled and evenly spaced along the axis.

  • Y-axis: The y-axis represents the frequency, count, or proportion associated with each category or group. It is labeled and scaled according to the range of values observed for the data.

  • Bar height or length: The height or length of each bar represents the frequency, count, or proportion associated with the corresponding category or group. The height or length of the bars can be uniform or vary depending on the data values.

  • Spacing and width: The bars in a bar plot are usually evenly spaced and have uniform width. The spacing between the bars helps distinguish individual categories or groups and creates visual separation.

  • Color or pattern: Different categories or groups can be represented using different colors or patterns for better differentiation and visual appeal. This can help convey additional information or highlight specific categories.

Bar plots are commonly used for comparing categorical data, displaying frequencies or counts, and visualizing relationships between different groups or categories. They provide a clear representation of the distribution and comparisons between the categories or groups. Bar plots are effective for conveying both qualitative and quantitative information, making them widely used in data analysis, research, and reporting.

Plot types based on data type

It is sometimes quite confusing to choose the right plot for your data. Here, you can find a nice guide. You can look into possible plot types depending on the data type or the number of variables you want to visualize.

Last updated

Change request #338: