Week 21
DATA VISUALIZATION
Last updated
DATA VISUALIZATION
Last updated
plotly.py is an interactive, open-source, and browser-based graphing library for Python.
Plotly's Python graphing library makes interactive, publication-quality graphs.
Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or hosted online using Chart Studio Cloud.
pip install plotly
conda install -c plotly plotly
The plotly Python package exists to create, manipulate and render graphical figures (i.e. charts, plots, maps and diagrams) represented by data structures also referred to as figures.
The rendering process uses the Plotly.js JavaScript library under the hood although Python developers using this module very rarely need to interact with the Javascript library directly, if ever.
Figures can be represented in Python either as dicts or as instances of the plotly.graph_objects.
At a low level, figures can be represented as dictionaries and displayed using functions from the plotly.io module.
The fig dictonary in the example below describes a figure. It contains a single bar trace and a title.
The plotly.graph_objects module provides an automatically-generated hierarchy of classes called "graph objects" that may be used to represent figures, with a top-level class plotly.graph_objects.Figure.
You can also create a graph object figure from a dictionary representation by passing the dictionary to the go.Figure constructor.
Plotly Express (included as the plotly.express module) is a high-level data visualization API that produces fully-populated graph object figures in single function-calls.
Every Plotly Express function uses graph objects internally and returns a plotly.graph_objects.
As demonstrated above, you can build a complete figure by passing trace and layout specifications to the plotly.graph_objects.Figure constructor. These trace and layout specifications can be either dictionaries or graph objects.
In the following example, the traces are specified using graph objects and the layout is specified as a dictionary.
The plotly.subplots.make_subplots() function produces a graph object figure that is preconfigured with a grid of subplots that traces can be added to.
Regardless of how a graph object figure was constructed, it can be updated by adding additional traces to it and modifying its properties.
New traces can be added to a graph object figure using the add_trace() method. This method accepts a graph object trace (an instance of go.Scatter, go.Bar, etc.) and adds it to the figure. This allows you to start with an empty figure, and add traces to it sequentially. The append_trace() method does the same thing, although it does not return the figure.
You can also add traces to a figure produced by a figure factory or Plotly Express.
This also works for figures created by Plotly Express using the facet_row and or facet_col arguments.
To make it easier to work with nested properties, graph object constructors and many graph object methods support magic underscore notation.
This allows you to reference nested properties by joining together multiple nested property names with underscores.
For example, specifying the figure title in the figure constructor without magic underscore notation requires setting the layout argument to dict(title=dict(text="A Chart")).
Similarly, setting the line color of a scatter trace requires setting the marker property to dict(color="crimson").
With magic underscore notation, you can accomplish the same thing by passing the figure constructor a keyword argument named layout_title_text, and by passing the go.Scatter constructor a keyword argument named line_color.
Graph object figures support an update_layout() method that may be used to update multiple nested properties of a figure's layout.
Here is an example of updating the text and font size of a figure's title using update_layout().
Note that the following update_layout() operations are equivalent:
Graph object figures support an update_traces() method that may be used to update multiple nested properties of one or more of a figure's traces.
To show some examples, we will start with a figure that contains bar and scatter traces across two subplots.
Note that both scatter and bar traces have a marker.color property to control their coloring. Here is an example of using update_traces() to modify the color of all traces.
The update_traces() method supports a selector argument to control which traces should be updated. Only traces with properties that match the selector will be updated. Here is an example of using a selector to only update the color of the bar traces.
Magic underscore notation can be used in the selector to match nested properties. Here is an example of updating the color of all traces that were formally colored "MediumPurple".
For figures with subplots, the update_traces() method also supports row and col arguments to control which traces should be updated. Only traces in the specified subplot row and column will be updated. Here is an example of updating the color of all traces in the second subplot column.
The update_traces() method can also be used on figures produced by figure factories or Plotly Express. Here's an example of updating the regression lines produced by Plotly Express to be dotted.
Graph object figures support update_xaxes() and update_yaxes() methods that may be used to update multiple nested properties of one or more of a figure's axes. Here is an example of using update_xaxes() to disable the vertical grid lines across all subplots in a figure produced by Plotly Express.
All of the figure update operations described above are methods that return a reference to the figure being modified. This makes it possible the chain multiple figure modification operations together into a single expression.
Here is an example of a chained expression that creates:
a faceted scatter plot with OLS trend lines using Plotly Express,
sets the title font size using update_layout(),
disables vertical grid lines using update_xaxes(),
updates the width and dash pattern of the trend lines using update_traces(),
and then displays the figure using show().
Trace and layout properties can be updated using property assignment syntax. Here is an example of setting the figure title using property assignment.
And here is an example of updating the bar outline using property assignment.
With px.scatter, each data point is represented as a marker point, whose location is given by the x and y columns.
Note that color and size data are added to hover information. You can add other columns to hover data with the hover_data argument of px.scatter.
A bubble chart is a scatter plot in which a third dimension of the data is shown through the size of markers.
With px.line, each data point is represented as a vertex (which location is given by the x and y columns) of a polyline mark in 2D space.
With px.bar, each row of the DataFrame is represented as a rectangular mark.
When several rows share the same value of x (here Female or Male), the rectangles are stacked on top of one another by default.
Use the keyword arguments facet_row (resp. facet_col) to create facetted subplots, where different rows (resp. columns) correspond to different values of the dataframe column specified in facet_row.
A pie chart is a circular statistical chart, which is divided into sectors to illustrate numerical proportion.
In statistics, a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented.
More generally, in plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count...). Also, the data to be binned can be numerical data but also categorical or date data.
JavaScript calculates the y-axis (count) values on the fly in the browser, so it's not accessible in the fig. You can manually calculate it using np.histogram.
The default mode is to represent the count of samples in each bin.
With the histnorm argument, it is also possible to represent the percentage or fraction of samples in each bin (histnorm='percent' or probability), or a density histogram (the sum of all bar areas equals the total number of sample points, density), or a probability density histogram (the sum of all bar areas equals 1, probability density).
For each bin of x, one can compute a function of data using histfunc. The argument of histfunc is the dataframe column given as the y argument. Below the plot shows that the average tip increases with the total bill.
With the marginal keyword, a subplot is drawn alongside the histogram, visualizing the distribution.
A box plot is a statistical representation of numerical data through their quartiles. The ends of the box represent the lower and upper quartiles, while the median (second quartile) is marked by a line inside the box.
In a box plot created by px.box, the distribution of the column given as y argument is represented.
If a column name is given as x argument, a box plot is drawn for each value of x.
By default, quartiles for box plots are computed using the linear method (for more about linear interpolation, see #10 listed on http://www.amstat.org/publications/jse/v14n3/langford.html and https://en.wikipedia.org/wiki/Quartile for more details).
However, you can also choose to use an exclusive or an inclusive algorithm to compute quartiles.
The exclusive algorithm uses the median to divide the ordered dataset into two halves. If the sample is odd, it does not include the median in either half. Q1 is then the median of the lower half and Q3 is the median of the upper half.
The inclusive algorithm also uses the median to divide the ordered dataset into two halves, but if the sample is odd, it includes the median in both halves. Q1 is then the median of the lower half and Q3 the median of the upper half.
A violin plot is a statistical representation of numerical data.
It is similar to a box plot, with the addition of a rotated kernel density plot on each side.
A scatterplot matrix is a matrix associated to n numerical arrays (data variables), X1,X2,…,Xn , of the same length. The cell (i,j) of such a matrix displays the scatter plot of the variable Xi versus Xj.
Here we show the Plotly Express function px.scatter_matrix to plot the scatter matrix for the columns of the dataframe. By default, all columns are considered.
Specify the columns to be represented with the dimensions argument, and set colors using a column of the dataframe:
The scatter matrix plot can be configured thanks to the parameters of px.scatter_matrix, but also thanks to fig.update_traces for fine tuning (see the next section to learn more about the options).
Like the 2D scatter plot px.scatter, the 3D function px.scatter_3d plots individual data in three-dimensional space.
A 4th dimension of the data can be represented thanks to the color of the markers. Also, values from the species column are used below to assign symbols to markers.
It is possible to customize the style of the figure through the parameters of px.scatter_3d for some options, or by updating the traces or the layout of the figure through fig.update.
An inset plot is a layer which is added to an existing layer in a graph window.
A Radar Chart (also known as a spider plot or star plot) displays multivariate data in the form of a two-dimensional chart of quantitative variables represented on axes originating from the center. The relative position and angle of the axes is typically uninformative. It is equivalent to a parallel coordinates plot with the axes arranged radially.
Plotly supports two different kinds of maps:
Mapbox maps are tile-based maps. If your figure is created with a px.scatter_mapbox, px.line_mapbox, px.choropleth_mapbox or px.density_mapbox function or otherwise contains one or more traces of type go.Scattermapbox, go.Choroplethmapbox or go.Densitymapbox, the layout.mapbox object in your figure contains configuration information for the map itself.
Geo maps are outline-based maps. If your figure is created with a px.scatter_geo, px.line_geo or px.choropleth function or otherwise contains one or more traces of type go.Scattergeo or go.Choropleth, the layout.geo object in your figure contains configuration information for the map itself.
Mapbox tile maps are composed of various layers, of three different types:
layout.mapbox.style defines is the lowest layers, also known as your "base map"
The various traces in data are by default rendered above the base map (although this can be controlled via the below attribute).
layout.mapbox.layers is an array that defines more layers that are by default rendered above the traces in data (although this can also be controlled via the below attribute).
The word "mapbox" in the trace names and layout.mapbox refers to the Mapbox GL JS open-source library, which is integrated into Plotly.py.
If your basemap in layout.mapbox.style uses data from the Mapbox service, then you will need to register for a free account at https://mapbox.com/ and obtain a Mapbox Access token. This token should be provided in layout.mapbox.access_token (or, if using Plotly Express, via the px.set_mapbox_access_token() configuration function).
If your layout.mapbox.style does not use data from the Mapbox service, you do not need to register for a Mapbox account.
Base Maps in layout.mapbox.style
The accepted values for layout.mapbox.style are one of:
"white-bg" yields an empty white canvas which results in no external HTTP requests
"open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or "stamen-watercolor" yeild maps composed of raster tiles from various public tile servers which do not require signups or access tokens
"basic", "streets", "outdoors", "light", "dark", "satellite", or "satellite-streets" yeild maps composed of vector tiles from the Mapbox service, and do require a Mapbox Access Token or an on-premise Mapbox installation.
A Mapbox service style URL, which requires a Mapbox Access Token or an on-premise Mapbox installation.
A Mapbox Style object as defined at https://docs.mapbox.com/mapbox-gl-js/style-spec/
Here is a simple map rendered with OpenStreetMaps tiles, without needing a Mapbox Access Token:
px.scatter_mapbox can work well with GeoPandas dataframes whose geometry is of type Point.
A Choropleth Map is a map composed of colored polygons. It is used to represent spatial variations of a quantity.
main parameters for choropleth tile maps
Making choropleth Mapbox maps requires two main types of input:
GeoJSON-formatted geometry information where each feature has either an id field or some identifying value in properties.
A list of values indexed by feature identifier.
The GeoJSON data is passed to the geojson argument, and the data is passed into the color argument of px.choropleth_mapbox (z if using graph_objects), in the same order as the IDs are passed into the location argument.
Note the geojson attribute can also be the URL to a GeoJSON file, which can speed up map rendering in certain cases.
GeoJSON with feature.id
Here we load a GeoJSON file containing the geometry information for US counties, where feature.id is a FIPS code.
Data indexed by id
Here we load unemployment data by county, also indexed by FIPS code.
With px.choropleth_mapbox, each row of the DataFrame is represented as a region of the choropleth.
If the GeoJSON you are using either does not have an id field or you wish you use one of the keys in the properties field, you may use the featureidkey parameter to specify where to match the values of locations.
In the following GeoJSON object/data-file pairing, the values of properties.district match the values of the district column:
To use them together, we set locations to district and featureidkey to "properties.district". The color is set to the number of votes by the candidate named Bergeron.
In addition to continuous colors, we can discretely-color our choropleth maps by setting color to a non-numerical column, like the name of the winner of an election.
With px.scatter_geo, each line of the dataframe is represented as a marker point. The column set as the size argument gives the size of markers.
Several Plotly Express functions support the creation of animated figures through the animation_frame and animation_group arguments.
Here is an example of an animated scatter plot creating using Plotly Express. Note that you should always fix the x_range and y_range to ensure that your data remains visible throughout the animation.
missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.
pip install missingno
The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.
This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.
msno.bar is a simple visualization of nullity by column.
This bar chart gives you an idea about how many missing values are there in each column.
You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.
The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.
The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.
Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity, then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.
As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.