Week 21

DATA VISUALIZATION

PLOTLY

Definition

  • plotly.py is an interactive, open-source, and browser-based graphing library for Python.

  • Plotly's Python graphing library makes interactive, publication-quality graphs.

  • Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or hosted online using Chart Studio Cloud.

Resources

Installation

  • pip install plotly

  • conda install -c plotly plotly

The Figure Data Structure in Python

  • The plotly Python package exists to create, manipulate and render graphical figures (i.e. charts, plots, maps and diagrams) represented by data structures also referred to as figures.

  • The rendering process uses the Plotly.js JavaScript library under the hood although Python developers using this module very rarely need to interact with the Javascript library directly, if ever.

  • Figures can be represented in Python either as dicts or as instances of the plotly.graph_objects.

Basic Structure

import plotly.express as px #import the plotly express

fig = px.line(x=["a","b","c"], y=[1,3,2], title="sample figure") #basic plotly express code to create a line graph

print(fig) # to view the underlying data structure

fig.show() #to show the fig graph

Creating and Updating Figures in Python

Figures As Dictionaries

  • At a low level, figures can be represented as dictionaries and displayed using functions from the plotly.io module.

  • The fig dictonary in the example below describes a figure. It contains a single bar trace and a title.

#Figures As Dictionaries
fig = dict({
    "data": [{"type": "bar",
              "x": [1, 2, 3],
              "y": [1, 3, 2]}],
    "layout": {"title": {"text": "A Figure Specified By Python Dictionary"}}
})

# To display the figure defined by this dict, use the low-level plotly.io.show function
import plotly.io as pio

pio.show(fig)

Figures as Graph Objects

  • The plotly.graph_objects module provides an automatically-generated hierarchy of classes called "graph objects" that may be used to represent figures, with a top-level class plotly.graph_objects.Figure.

#Figures as Graph Objects
import plotly.graph_objects as go

#create a fig object with go.Figure()
#define graph type, x-axis, y-axis and layout format
fig = go.Figure(
    data=[go.Bar(x=[1, 2, 3], y=[1, 3, 2])],
    layout=go.Layout(
        title=go.layout.Title(text="A Figure Specified By A Graph Object")
    )
)

fig.show()

  • You can also create a graph object figure from a dictionary representation by passing the dictionary to the go.Figure constructor.

import plotly.graph_objects as go

dict_of_fig = dict({
    "data": [{"type": "bar",
              "x": [1, 2, 3],
              "y": [1, 3, 2]}],
    "layout": {"title": {"text": "A Figure Specified By A Graph Object With A Dictionary"}}
})

fig = go.Figure(dict_of_fig)

fig.show()

Creating Figures

Plotly Express

  • Plotly Express (included as the plotly.express module) is a high-level data visualization API that produces fully-populated graph object figures in single function-calls.

  • Every Plotly Express function uses graph objects internally and returns a plotly.graph_objects.

import plotly.express as px

df = px.data.iris() #data

#create a fig object, define the data, x-axis, y-axis, 3rd dimension as color, title.
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", title="A Plotly Express Figure")

# If you print the figure, you'll see that it's just a regular figure with data and layout
print(fig)

fig.show()

Graph Objects Figure Constructor

  • As demonstrated above, you can build a complete figure by passing trace and layout specifications to the plotly.graph_objects.Figure constructor. These trace and layout specifications can be either dictionaries or graph objects.

  • In the following example, the traces are specified using graph objects and the layout is specified as a dictionary.

import plotly.graph_objects as go
import plotly.express as px

df = px.data.iris() #data

fig = go.Figure() #create a figure object

#fig.add_trace() adds trace and graph to the object. 
fig.add_trace(
    go.Scatter(
        x=df.sepal_width, 
        y=df.sepal_length,
        mode="markers"
    )
)

#fig.update_layout() defines the layout and style of the graph.
fig.update_layout(dict(title='A Figure Specified By A Graph Object',
                       xaxis= dict(title= 'sepal_width'),
                       yaxis= dict(title= 'sepal_length')
                 ))

fig.show()

Subplots

  • The plotly.subplots.make_subplots() function produces a graph object figure that is preconfigured with a grid of subplots that traces can be added to.

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2) # define the number of subplots for each row and column in the defined field

fig.add_trace(go.Scatter(y=[4, 2, 1], mode="lines"), row=1, col=1)
fig.add_trace(go.Bar(y=[2, 1, 3]), row=1, col=2)

fig.show()

Updating Figures

Adding traces

  • Regardless of how a graph object figure was constructed, it can be updated by adding additional traces to it and modifying its properties.

  • New traces can be added to a graph object figure using the add_trace() method. This method accepts a graph object trace (an instance of go.Scatter, go.Bar, etc.) and adds it to the figure. This allows you to start with an empty figure, and add traces to it sequentially. The append_trace() method does the same thing, although it does not return the figure.

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(x=[1, 2, 3], y=[1, 3, 2]))

fig.show()
  • You can also add traces to a figure produced by a figure factory or Plotly Express.

import plotly.express as px
import plotly.graph_objects as go

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 title="Using The add_trace() method With A Plotly Express Figure")

fig.add_trace(
    go.Scatter(
        x=[2, 4],
        y=[4, 8],
        mode="lines",
        line=go.scatter.Line(color="gray"),
        showlegend=False)
)

fig.show()

Adding Traces To Subplots

  • This also works for figures created by Plotly Express using the facet_row and or facet_col arguments.

import plotly.express as px

df = px.data.iris()

#define a dimension with a categorical variable and create subplots for each category. 
# facet_col creates sub-plots as columns.
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", facet_col="species",
                 title="Adding Traces To Subplots Witin A Plotly Express Figure")

reference_line = go.Scatter(x=[2, 4],
                            y=[4, 8],
                            mode="lines",
                            line=go.scatter.Line(color="gray"),
                            showlegend=False)

fig.add_trace(reference_line, row=1, col=1)
fig.add_trace(reference_line, row=1, col=2)
fig.add_trace(reference_line, row=1, col=3)

fig.show()
import plotly.express as px

df = px.data.iris()

# facet_row creates sub-plots as rows.
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", facet_row="species",
                 title="Adding Traces To Subplots Witin A Plotly Express Figure")

reference_line = go.Scatter(x=[2, 4],
                            y=[4, 8],
                            mode="lines",
                            line=go.scatter.Line(color="gray"),
                            showlegend=False)

fig.add_trace(reference_line, row=1, col=1)
fig.add_trace(reference_line, row=2, col=1)
fig.add_trace(reference_line, row=3, col=1)

fig.show()

Magic Underscore Notation

  • To make it easier to work with nested properties, graph object constructors and many graph object methods support magic underscore notation.

  • This allows you to reference nested properties by joining together multiple nested property names with underscores.

  • For example, specifying the figure title in the figure constructor without magic underscore notation requires setting the layout argument to dict(title=dict(text="A Chart")).

  • Similarly, setting the line color of a scatter trace requires setting the marker property to dict(color="crimson").

import plotly.graph_objects as go

fig = go.Figure(
    data=[go.Scatter(y=[1, 3, 2], line=dict(color="crimson"))],
    layout=dict(title=dict(text="A Graph Object Figure With Magic Underscore Notation"))
)

fig.show()
  • With magic underscore notation, you can accomplish the same thing by passing the figure constructor a keyword argument named layout_title_text, and by passing the go.Scatter constructor a keyword argument named line_color.

import plotly.graph_objects as go

fig = go.Figure(
    data=[go.Scatter(y=[1, 3, 2], line_color="crimson")],
    layout_title_text="Another Graph Object Figure With Magic Underscore Notation"
)

fig.show()

Updating Figure Layouts

  • Graph object figures support an update_layout() method that may be used to update multiple nested properties of a figure's layout.

  • Here is an example of updating the text and font size of a figure's title using update_layout().

import plotly.graph_objects as go

fig = go.Figure(data=go.Bar(x=[1, 2, 3], y=[1, 3, 2]))

fig.update_layout(title_text="Using update_layout() With Graph Object Figures",
                  title_font_size=30)

fig.show()
  • Note that the following update_layout() operations are equivalent:

fig.update_layout(title_text="update_layout() Syntax Example",
                  title_font_size=30)

fig.update_layout(title_text="update_layout() Syntax Example",
                  title_font=dict(size=30))


fig.update_layout(title=dict(text="update_layout() Syntax Example"),
                             font=dict(size=30))

fig.update_layout({"title": {"text": "update_layout() Syntax Example",
                             "font": {"size": 30}}})

fig.update_layout(title=go.layout.Title(text="update_layout() Syntax Example",
                                        font=go.layout.title.Font(size=30)))

Updating Traces

  • Graph object figures support an update_traces() method that may be used to update multiple nested properties of one or more of a figure's traces.

  • To show some examples, we will start with a figure that contains bar and scatter traces across two subplots.

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2)

fig.add_scatter(y=[4, 2, 3.5], mode="markers",
                marker=dict(size=20, color="LightSeaGreen"),
                name="a", row=1, col=1)

fig.add_bar(y=[2, 1, 3],
            marker=dict(color="MediumPurple"),
            name="b", row=1, col=1)

fig.add_scatter(y=[2, 3.5, 4], mode="markers",
                marker=dict(size=20, color="MediumPurple"),
                name="c", row=1, col=2)

fig.add_bar(y=[1, 3, 2],
            marker=dict(color="LightSeaGreen"),
            name="d", row=1, col=2)

fig.show()
  • Note that both scatter and bar traces have a marker.color property to control their coloring. Here is an example of using update_traces() to modify the color of all traces.

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2)

fig.add_scatter(y=[4, 2, 3.5], mode="markers",
                marker=dict(size=20, color="LightSeaGreen"),
                name="a", row=1, col=1)

fig.add_bar(y=[2, 1, 3],
            marker=dict(color="MediumPurple"),
            name="b", row=1, col=1)

fig.add_scatter(y=[2, 3.5, 4], mode="markers",
                marker=dict(size=20, color="MediumPurple"),
                name="c", row=1, col=2)

fig.add_bar(y=[1, 3, 2],
            marker=dict(color="LightSeaGreen"),
            name="d", row=1, col=2)

#update the defined traces above with update_traces method
fig.update_traces(marker=dict(color="RoyalBlue"))

fig.show()
  • The update_traces() method supports a selector argument to control which traces should be updated. Only traces with properties that match the selector will be updated. Here is an example of using a selector to only update the color of the bar traces.


fig = make_subplots(rows=1, cols=2)

fig.add_scatter(y=[4, 2, 3.5], mode="markers",
                marker=dict(size=20, color="LightSeaGreen"),
                name="a", row=1, col=1)

fig.add_bar(y=[2, 1, 3],
            marker=dict(color="MediumPurple"),
            name="b", row=1, col=1)

fig.add_scatter(y=[2, 3.5, 4], mode="markers",
                marker=dict(size=20, color="MediumPurple"),
                name="c", row=1, col=2)

fig.add_bar(y=[1, 3, 2],
            marker=dict(color="LightSeaGreen"),
            name="d", row=1, col=2)

#a specific trace can be updated
fig.update_traces(marker=dict(color="RoyalBlue"),
                  selector=dict(type="bar"))

fig.show()
  • Magic underscore notation can be used in the selector to match nested properties. Here is an example of updating the color of all traces that were formally colored "MediumPurple".

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2)

fig.add_scatter(y=[4, 2, 3.5], mode="markers",
                marker=dict(size=20, color="LightSeaGreen"),
                name="a", row=1, col=1)

fig.add_bar(y=[2, 1, 3],
            marker=dict(color="MediumPurple"),
            name="b", row=1, col=1)

fig.add_scatter(y=[2, 3.5, 4], mode="markers",
                marker=dict(size=20, color="MediumPurple"),
                name="c", row=1, col=2)

fig.add_bar(y=[1, 3, 2],
            marker=dict(color="LightSeaGreen"),
            name="d", row=1, col=2)

#update a specific color in the graph
fig.update_traces(marker_color="RoyalBlue",
                  selector=dict(marker_color="MediumPurple"))

fig.show()
  • For figures with subplots, the update_traces() method also supports row and col arguments to control which traces should be updated. Only traces in the specified subplot row and column will be updated. Here is an example of updating the color of all traces in the second subplot column.

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2)

fig.add_scatter(y=[4, 2, 3.5], mode="markers",
                marker=dict(size=20, color="LightSeaGreen"),
                name="a", row=1, col=1)

fig.add_bar(y=[2, 1, 3],
            marker=dict(color="MediumPurple"),
            name="b", row=1, col=1)

fig.add_scatter(y=[2, 3.5, 4], mode="markers",
                marker=dict(size=20, color="MediumPurple"),
                name="c", row=1, col=2)

fig.add_bar(y=[1, 3, 2],
            marker=dict(color="LightSeaGreen"),
            name="d", row=1, col=2)

#update a specific column or row
fig.update_traces(marker=dict(color="RoyalBlue"),
                  col=2)

fig.show()
  • The update_traces() method can also be used on figures produced by figure factories or Plotly Express. Here's an example of updating the regression lines produced by Plotly Express to be dotted.

import pandas as pd
import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 facet_col="species", trendline="ols", title="Using update_traces() With Plotly Express Figures")
fig.show()
import pandas as pd
import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 facet_col="species", trendline="ols", title="Using update_traces() With Plotly Express Figures")

fig.update_traces(
    line=dict(dash="dot", width=4),
    selector=dict(type="scatter", mode="lines"))

fig.show()

Updating Figure Axes

  • Graph object figures support update_xaxes() and update_yaxes() methods that may be used to update multiple nested properties of one or more of a figure's axes. Here is an example of using update_xaxes() to disable the vertical grid lines across all subplots in a figure produced by Plotly Express.

import pandas as pd
import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 facet_col="species", title="Using update_xaxes() With A Plotly Express Figure")

fig.update_xaxes(showgrid=False)

fig.show()

Chaining Figure Operations

  • All of the figure update operations described above are methods that return a reference to the figure being modified. This makes it possible the chain multiple figure modification operations together into a single expression.

  • Here is an example of a chained expression that creates:

    • a faceted scatter plot with OLS trend lines using Plotly Express,

    • sets the title font size using update_layout(),

    • disables vertical grid lines using update_xaxes(),

    • updates the width and dash pattern of the trend lines using update_traces(),

    • and then displays the figure using show().

import plotly.express as px

df = px.data.iris()

(px.scatter(df, x="sepal_width", y="sepal_length", color="species",
            facet_col="species", trendline="ols",
            title="Chaining Multiple Figure Operations With A Plotly Express Figure")
 .update_layout(title_font_size=24)
 .update_xaxes(showgrid=False)
 .update_traces(
     line=dict(dash="dot", width=4),
     selector=dict(type="scatter", mode="lines"))
).show()

Property Assignment

  • Trace and layout properties can be updated using property assignment syntax. Here is an example of setting the figure title using property assignment.

import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(x=[1, 2, 3], y=[1, 3, 2]))
fig.layout.title.text = "Using Property Assignment Syntax With A Graph Object Figure"
fig.show()
  • And here is an example of updating the bar outline using property assignment.

import plotly.graph_objects as go

fig = go.Figure(data=go.Bar(x=[1, 2, 3], y=[1, 3, 2]))

fig.data[0].marker.line.width = 4
fig.data[0].marker.line.color = "black"

fig.show()

Themes

import plotly.io as pio
pio.templates
import plotly.express as px

df = px.data.gapminder()
df_2007 = df.query("year==2007")

for template in ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"]:
    fig = px.scatter(df_2007,
                     x="gdpPercap", y="lifeExp", size="pop", color="continent",
                     log_x=True, size_max=60,
                     template=template, title="Gapminder 2007: '%s' theme" % template)
    fig.show()
import plotly.graph_objects as go
import pandas as pd

z_data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv")

fig = go.Figure(
    data=go.Surface(z=z_data.values),
    layout=go.Layout(
        title="Mt Bruno Elevation",
        width=500,
        height=500,
    ))

for template in ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"]:
    fig.update_layout(template=template, title="Mt Bruno Elevation: '%s' theme" % template)
    fig.show()
import plotly.graph_objects as go
import pandas as pd

z_data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv")

fig = go.Figure(
    data=go.Surface(z=z_data.values),
    layout=go.Layout(
        title="Mt Bruno Elevation",
        width=500,
        height=500,
        template='plotly_dark'
    ))
fig.show()

Scatter

  • With px.scatter, each data point is represented as a marker point, whose location is given by the x and y columns.

# x and y given as DataFrame columns
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length")
fig.show()

Set size and color with column names

  • Note that color and size data are added to hover information. You can add other columns to hover data with the hover_data argument of px.scatter.

import plotly.express as px
df = px.data.iris()

#define additional dimensions by color, size and hover_data
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 size='petal_length', hover_data=['petal_width'])
fig.show()

Bubble Charts

  • A bubble chart is a scatter plot in which a third dimension of the data is shown through the size of markers.

import plotly.express as px
df = px.data.gapminder()

#define additional information with hover_name
fig = px.scatter(df.query("year==2007"), x="gdpPercap", y="lifeExp",
                 size="pop", color="continent",
                 hover_name="country", log_x=True, size_max=60)
fig.show()

Line Plot

  • With px.line, each data point is represented as a vertex (which location is given by the x and y columns) of a polyline mark in 2D space.

import plotly.express as px

df = px.data.gapminder().query("country=='Canada'")
fig = px.line(df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()
data = px.data.gapminder()
data.head()
import plotly.express as px

df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df, x="year", y="lifeExp", color='country')
fig.show()
import plotly.express as px

df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df, x="year", y="lifeExp", color='country')
fig.data[1].update(mode='markers+lines') #change the style of the line
fig.show()
import plotly.express as px

df = px.data.gapminder().query("continent != 'Asia'") # remove Asia for visibility
fig = px.line(df, x="year", y="lifeExp", color="continent",
              line_group="country", hover_name="country", hover_data=['pop']) #define group of lines
fig.show()

Bar Chart

  • With px.bar, each row of the DataFrame is represented as a rectangular mark.

import plotly.express as px
data_canada = px.data.gapminder().query("country == 'Canada'")
fig = px.bar(data_canada, x='year', y='pop')
fig.show()
import plotly.express as px

long_df = px.data.medals_long()

#show counts of different categories in one bar
fig = px.bar(long_df, x="nation", y="count", color="medal", title="Long-Form Input")
fig.show()
import plotly.express as px

wide_df = px.data.medals_wide()

fig = px.bar(wide_df, x="nation", y=["gold", "silver", "bronze"], title="Wide-Form Input")
fig.show()
import plotly.express as px
data = px.data.gapminder()

data_canada = data[data.country == 'Canada']
fig = px.bar(data_canada, x='year', y='pop',
             hover_data=['lifeExp', 'gdpPercap'], color='lifeExp',
             labels={'pop':'population of Canada'}, height=400) #update labels 
fig.show()
  • When several rows share the same value of x (here Female or Male), the rectangles are stacked on top of one another by default.

import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()
df.groupby('sex').sum('total_bill')

Facetted subplots

  • Use the keyword arguments facet_row (resp. facet_col) to create facetted subplots, where different rows (resp. columns) correspond to different values of the dataframe column specified in facet_row.

import plotly.express as px
df = px.data.tips()

#facet-col as days, facet_row as time
fig = px.bar(df, x="sex", y="total_bill", color="smoker", barmode="group",
             facet_row="time", facet_col="day",
             category_orders={"day": ["Thur", "Fri", "Sat", "Sun"],
                              "time": ["Lunch", "Dinner"]})
fig.show()

Pie Chart

  • A pie chart is a circular statistical chart, which is divided into sectors to illustrate numerical proportion.

import plotly.express as px

# This dataframe has 244 lines, but 4 distinct values for `day`
df = px.data.tips()
fig = px.pie(df, values='tip', names='day')
fig.show()
import plotly.express as px

df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries

#add title
fig = px.pie(df, values='pop', names='country', title='Population of European continent')
fig.show()
import plotly.express as px

df = px.data.gapminder().query("year == 2007").query("continent == 'Americas'")

#add hover_data, update labels
fig = px.pie(df, values='pop', names='country',
             title='Population of American continent',
             hover_data=['lifeExp'], labels={'lifeExp':'life expectancy'})

#update text position and text
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

Donut Chart

import plotly.express as px

df = px.data.tips()

fig = px.pie(df, values='tip', names='day', hole=.3)
fig.show()

Histogram

  • In statistics, a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented.

  • More generally, in plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count...). Also, the data to be binned can be numerical data but also categorical or date data.

import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill")
fig.show()
import plotly.express as px
df = px.data.tips()

# Here we use a column with categorical data
fig = px.histogram(df, x="day")
fig.show()
# Choosing the number of bins
import plotly.express as px
df = px.data.tips()

#adjust the number of bins
fig = px.histogram(df, x="total_bill", nbins=10)
fig.show()

Accessing the counts (y-axis) values

  • JavaScript calculates the y-axis (count) values on the fly in the browser, so it's not accessible in the fig. You can manually calculate it using np.histogram.

import plotly.express as px
import numpy as np

df = px.data.tips()

# create the bins
counts, bins = np.histogram(df.total_bill, bins=range(0, 60, 5))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'total_bill', 'y':'count'})
fig.show()

Type of normalization

  • The default mode is to represent the count of samples in each bin.

  • With the histnorm argument, it is also possible to represent the percentage or fraction of samples in each bin (histnorm='percent' or probability), or a density histogram (the sum of all bar areas equals the total number of sample points, density), or a probability density histogram (the sum of all bar areas equals 1, probability density).

import plotly.express as px

df = px.data.tips()

fig = px.histogram(df, x="total_bill", histnorm='probability density')
fig.show()
import plotly.express as px

df = px.data.tips()

fig = px.histogram(df, x="total_bill",
                   title='Histogram of bills',
                   labels={'total_bill':'total bill'}, # can specify one label per df column
                   opacity=0.8,
                   log_y=True, # represent bars with log scale
                   color_discrete_sequence=['indianred'] # color of histogram bars
                   )
fig.show()

Several histograms for the different values of one column

import plotly.express as px

df = px.data.tips()

fig = px.histogram(df, x="total_bill", color="sex")
fig.show()

Using histfunc

  • For each bin of x, one can compute a function of data using histfunc. The argument of histfunc is the dataframe column given as the y argument. Below the plot shows that the average tip increases with the total bill.

import plotly.express as px

df = px.data.tips()

fig = px.histogram(df, x="total_bill", y="tip", histfunc='avg')
fig.show()

Visualizing the distribution

  • With the marginal keyword, a subplot is drawn alongside the histogram, visualizing the distribution.

import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill", color="sex", marginal="violin", # can be `box`, `violin`, 'rug'
                         hover_data=df.columns)
fig.show()

Box Plot

  • A box plot is a statistical representation of numerical data through their quartiles. The ends of the box represent the lower and upper quartiles, while the median (second quartile) is marked by a line inside the box.

  • In a box plot created by px.box, the distribution of the column given as y argument is represented.

import plotly.express as px

df = px.data.tips()

fig = px.box(df, y="total_bill")
fig.show()
  • If a column name is given as x argument, a box plot is drawn for each value of x.

import plotly.express as px

df = px.data.tips()

fig = px.box(df, x="time", y="total_bill")
fig.show()
import plotly.express as px

df = px.data.tips()

fig = px.box(df, x="time", y="total_bill", points="all") #outliers is the default value
fig.show()

Choosing The Algorithm For Computing Quartiles

  • By default, quartiles for box plots are computed using the linear method (for more about linear interpolation, see #10 listed on http://www.amstat.org/publications/jse/v14n3/langford.html and https://en.wikipedia.org/wiki/Quartile for more details).

  • However, you can also choose to use an exclusive or an inclusive algorithm to compute quartiles.

  • The exclusive algorithm uses the median to divide the ordered dataset into two halves. If the sample is odd, it does not include the median in either half. Q1 is then the median of the lower half and Q3 is the median of the upper half.

  • The inclusive algorithm also uses the median to divide the ordered dataset into two halves, but if the sample is odd, it includes the median in both halves. Q1 is then the median of the lower half and Q3 the median of the upper half.

import plotly.express as px

df = px.data.tips()

fig = px.box(df, x="day", y="total_bill", color="smoker")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
import plotly.express as px

df = px.data.tips()

fig = px.box(df, x="day", y="total_bill", color="smoker")
fig.update_traces(quartilemethod="inclusive") # or "inclusive", or "linear" by default
fig.show()

Violin Plot

  • A violin plot is a statistical representation of numerical data.

  • It is similar to a box plot, with the addition of a rotated kernel density plot on each side.

import plotly.express as px

df = px.data.tips()
fig = px.violin(df, y="total_bill")
fig.show()
import plotly.express as px

df = px.data.tips()
fig = px.violin(df, y="total_bill", box=True, # draw box plot inside the violin
                points='all', # can be 'outliers', or False
               )
fig.show()
import plotly.express as px

df = px.data.tips()

fig = px.violin(df, y="tip", x="smoker", color="sex", box=True, points="all",
          hover_data=df.columns)
fig.show()
import plotly.express as px

df = px.data.tips()

fig = px.violin(df, y="tip", color="sex",
                violinmode='overlay', # draw violins on top of each other
                # default violinmode is 'group' as in example above
                hover_data=df.columns)
fig.show()

Scatterplot Matrix

  • A scatterplot matrix is a matrix associated to n numerical arrays (data variables), X1,X2,…,Xn , of the same length. The cell (i,j) of such a matrix displays the scatter plot of the variable Xi versus Xj.

  • Here we show the Plotly Express function px.scatter_matrix to plot the scatter matrix for the columns of the dataframe. By default, all columns are considered.

import plotly.express as px

df = px.data.iris()

fig = px.scatter_matrix(df)
fig.show()
  • Specify the columns to be represented with the dimensions argument, and set colors using a column of the dataframe:

import plotly.express as px

df = px.data.iris()

fig = px.scatter_matrix(df,
    dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
    color="species")
fig.show()

Styled Scatter Matrix with Plotly Express

  • The scatter matrix plot can be configured thanks to the parameters of px.scatter_matrix, but also thanks to fig.update_traces for fine tuning (see the next section to learn more about the options).

import plotly.express as px

df = px.data.iris()

fig = px.scatter_matrix(df,
    dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
    color="species", symbol="species",
    title="Scatter matrix of iris data set",
    labels={col:col.replace('_', ' ') for col in df.columns}) # remove underscore

fig.update_traces(diagonal_visible=False)
fig.show()

3D Scatter Plots

  • Like the 2D scatter plot px.scatter, the 3D function px.scatter_3d plots individual data in three-dimensional space.

import plotly.express as px

df = px.data.iris()

fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
              color='species')
fig.show()
  • A 4th dimension of the data can be represented thanks to the color of the markers. Also, values from the species column are used below to assign symbols to markers.

import plotly.express as px

df = px.data.iris()

fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
                    color='petal_length', symbol='species')
fig.show()

Style 3d scatter plot

  • It is possible to customize the style of the figure through the parameters of px.scatter_3d for some options, or by updating the traces or the layout of the figure through fig.update.

import plotly.express as px

df = px.data.iris()

fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
              color='petal_length', size='petal_length', size_max=18,
              symbol='species', opacity=0.7)

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

Inset Plot

  • An inset plot is a layer which is added to an existing layer in a graph window.

import plotly.graph_objs as go

trace1 = go.Scatter(
    x=[1, 2, 3],
    y=[4, 3, 2],
    name='graph 1'
)

trace2 = go.Scatter(
    x=[20, 30, 40],
    y=[30, 40, 50],
    xaxis='x2',
    yaxis='y2',
    name='graph 2'
)

data = [trace1, trace2]

layout = go.Layout(
    xaxis2=dict(
        domain=[0.6, 0.95],
        anchor='y2'
    ),
    yaxis2=dict(
        domain=[0.6, 0.95],
        anchor='x2'
    )
)

fig = go.Figure(data=data, layout=layout)
fig.show()

Radar (Spider) Chart

  • A Radar Chart (also known as a spider plot or star plot) displays multivariate data in the form of a two-dimensional chart of quantitative variables represented on axes originating from the center. The relative position and angle of the axes is typically uninformative. It is equivalent to a parallel coordinates plot with the axes arranged radially.

import plotly.express as px
import pandas as pd
df = pd.DataFrame(dict(
    r=[1, 5, 2, 2, 3],
    theta=['processing cost','mechanical properties','chemical stability',
           'thermal stability', 'device integration']))
fig = px.line_polar(df, r='r', theta='theta', line_close=True)
fig.show()
import plotly.express as px
import pandas as pd

df = pd.DataFrame(dict(
    r=[1, 5, 2, 2, 3],
    theta=['processing cost','mechanical properties','chemical stability',
           'thermal stability', 'device integration']))

#fill inside the graph
fig = px.line_polar(df, r='r', theta='theta', line_close=True)
fig.update_traces(fill='toself')
fig.show()

Map Plots

Mapbox Maps vs Geo Maps

  • Plotly supports two different kinds of maps:

    • Mapbox maps are tile-based maps. If your figure is created with a px.scatter_mapbox, px.line_mapbox, px.choropleth_mapbox or px.density_mapbox function or otherwise contains one or more traces of type go.Scattermapbox, go.Choroplethmapbox or go.Densitymapbox, the layout.mapbox object in your figure contains configuration information for the map itself.

    • Geo maps are outline-based maps. If your figure is created with a px.scatter_geo, px.line_geo or px.choropleth function or otherwise contains one or more traces of type go.Scattergeo or go.Choropleth, the layout.geo object in your figure contains configuration information for the map itself.

How Layers Work in Mapbox Tile Maps

  • Mapbox tile maps are composed of various layers, of three different types:

    • layout.mapbox.style defines is the lowest layers, also known as your "base map"

    • The various traces in data are by default rendered above the base map (although this can be controlled via the below attribute).

    • layout.mapbox.layers is an array that defines more layers that are by default rendered above the traces in data (although this can also be controlled via the below attribute).

Mapbox Access Tokens and When You Need Them

  • The word "mapbox" in the trace names and layout.mapbox refers to the Mapbox GL JS open-source library, which is integrated into Plotly.py.

  • If your basemap in layout.mapbox.style uses data from the Mapbox service, then you will need to register for a free account at https://mapbox.com/ and obtain a Mapbox Access token. This token should be provided in layout.mapbox.access_token (or, if using Plotly Express, via the px.set_mapbox_access_token() configuration function).

  • If your layout.mapbox.style does not use data from the Mapbox service, you do not need to register for a Mapbox account.

Base Maps in layout.mapbox.style

  • The accepted values for layout.mapbox.style are one of:

    • "white-bg" yields an empty white canvas which results in no external HTTP requests

    • "open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or "stamen-watercolor" yeild maps composed of raster tiles from various public tile servers which do not require signups or access tokens

    • "basic", "streets", "outdoors", "light", "dark", "satellite", or "satellite-streets" yeild maps composed of vector tiles from the Mapbox service, and do require a Mapbox Access Token or an on-premise Mapbox installation.

    • A Mapbox service style URL, which requires a Mapbox Access Token or an on-premise Mapbox installation.

    • A Mapbox Style object as defined at https://docs.mapbox.com/mapbox-gl-js/style-spec/

Scatter Map

Here is a simple map rendered with OpenStreetMaps tiles, without needing a Mapbox Access Token:

import pandas as pd
us_cities = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/us-cities-top-1k.csv")

import plotly.express as px

fig = px.scatter_mapbox(us_cities, lat="lat", lon="lon", hover_name="City", hover_data=["State", "Population"],
                        color_discrete_sequence=["fuchsia"], zoom=3, height=300)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
  • px.scatter_mapbox can work well with GeoPandas dataframes whose geometry is of type Point.

Choropleth Map

  • A Choropleth Map is a map composed of colored polygons. It is used to represent spatial variations of a quantity.

main parameters for choropleth tile maps

  • Making choropleth Mapbox maps requires two main types of input:

    • GeoJSON-formatted geometry information where each feature has either an id field or some identifying value in properties.

    • A list of values indexed by feature identifier.

    • The GeoJSON data is passed to the geojson argument, and the data is passed into the color argument of px.choropleth_mapbox (z if using graph_objects), in the same order as the IDs are passed into the location argument.

    • Note the geojson attribute can also be the URL to a GeoJSON file, which can speed up map rendering in certain cases.

GeoJSON with feature.id

  • Here we load a GeoJSON file containing the geometry information for US counties, where feature.id is a FIPS code.

from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

counties["features"][0]

Data indexed by id

  • Here we load unemployment data by county, also indexed by FIPS code.

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv",
                   dtype={"fips": str})
df.head()

Choropleth map using plotly.express and carto base map (no token needed)

  • With px.choropleth_mapbox, each row of the DataFrame is represented as a region of the choropleth.

from urllib.request import urlopen
import json

with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv",
                   dtype={"fips": str})

import plotly.express as px

fig = px.choropleth_mapbox(df, geojson=counties, locations='fips', color='unemp',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.5,
                           labels={'unemp':'unemployment rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Indexing by GeoJSON Properties

  • If the GeoJSON you are using either does not have an id field or you wish you use one of the keys in the properties field, you may use the featureidkey parameter to specify where to match the values of locations.

  • In the following GeoJSON object/data-file pairing, the values of properties.district match the values of the district column:

import plotly.express as px

df = px.data.election()
geojson = px.data.election_geojson()

print(df["district"][2])
print(geojson["features"][0]["properties"])
df.head()
  • To use them together, we set locations to district and featureidkey to "properties.district". The color is set to the number of votes by the candidate named Bergeron.

import plotly.express as px

df = px.data.election()
geojson = px.data.election_geojson()

fig = px.choropleth_mapbox(df, geojson=geojson, color="Bergeron",
                           locations="district", featureidkey="properties.district",
                           center={"lat": 45.5517, "lon": -73.7073},
                           mapbox_style="carto-positron", zoom=9)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Discrete Colors

  • In addition to continuous colors, we can discretely-color our choropleth maps by setting color to a non-numerical column, like the name of the winner of an election.

import plotly.express as px

df = px.data.election()
geojson = px.data.election_geojson()

fig = px.choropleth_mapbox(df, geojson=geojson, color="winner",
                           locations="district", featureidkey="properties.district",
                           center={"lat": 45.5517, "lon": -73.7073},
                           mapbox_style="carto-positron", zoom=9)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Bubble Maps

  • With px.scatter_geo, each line of the dataframe is represented as a marker point. The column set as the size argument gives the size of markers.

import plotly.express as px

df = px.data.gapminder().query("year==2007")

fig = px.scatter_geo(df, locations="iso_alpha", color="continent",
                     hover_name="country", size="pop",
                     projection="natural earth")
fig.show()

Bubble Map with animation

import plotly.express as px

df = px.data.gapminder()

fig = px.scatter_geo(df, locations="iso_alpha", color="continent",
                     hover_name="country", size="pop",
                     animation_frame="year",
                     projection="natural earth")
fig.show()

Animated Figures

  • Several Plotly Express functions support the creation of animated figures through the animation_frame and animation_group arguments.

  • Here is an example of an animated scatter plot creating using Plotly Express. Note that you should always fix the x_range and y_range to ensure that your data remains visible throughout the animation.

import plotly.express as px

df = px.data.gapminder()

px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])
import plotly.express as px

df = px.data.gapminder()

fig = px.bar(df, x="continent", y="pop", color="continent",
  animation_frame="year", animation_group="country", range_y=[0,4000000000])
fig.show()

Plotly tutorials and Examples with Plotly Express:

Plotly tutorials and examples with Figures as Graph Objects:

Rare Visualisations

Missingno

Definition:

  • missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.

Resources:

Installation:

  • pip install missingno

Matrix

  • The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

  • The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

  • This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

# Program to visualize missing values in dataset 
  
# Importing the libraries 
import pandas as pd 
import missingno as msno 
  
# Loading the dataset 
df = pd.read_csv("kamyr-digester.csv") 
# https://openmv.net/info/kamyr-digester
  
# Visualize missing values as a matrix 
msno.matrix(df);

Bar Chart

  • msno.bar is a simple visualization of nullity by column.

  • This bar chart gives you an idea about how many missing values are there in each column.

  • You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.

# Visualize the number of missing 
# values as a bar chart 
msno.bar(df);

Heatmap

  • The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

# Visualize the correlation between the number of 
# missing values in different columns as a heatmap 
msno.heatmap(df);

Dendrogram

  • The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.

  • The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

  • To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

  • Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity, then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

  • As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

#Visualize dendogram
msno.dendrogram(df);

Word Cloud

# pip install wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

timesData = pd.read_csv("timesData.csv")
x2011 = timesData.country[timesData.year == 2011]

plt.subplots(figsize=(8,8))
wordcloud = WordCloud(
                          background_color='white',
                          width=512,
                          height=384
                         ).generate(" ".join(x2011))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('graph.png')

plt.show()

Kaggle tutorial for Rare Visualisation tools

6MB
Data_visualisation_documents.rar

Last updated

Change request #338: