Week 19
Advanced Pandas DataFrames
Advanced Pandas DataFrames
In this module we will learn more about Pandas DataFrames. Specifically, we will learn how to pivot around and reshaping our DataFrames so we can better prepare them for the task we want them to perform. Before we dive into the pivoting, we need to now a little bit about MultiIndexes.
MultiIndexing
A MultiIndex, or hierarchical index, is a way to handle multiple levels of indexing in a pandas DataFrame. It allows you to have multiple levels of indexing on an axis. Here is an example of creating a DataFrame with a MultiIndex:
>>> # Create a list of tuples representing the data
>>> data = [
... ('A', 'a', 1, 'dog'),
... ('A', 'b', 2, 'dog'),
... ('B', 'a', 3, 'cat'),
... ('B', 'b', 4, 'cat')]
>>> # Create a MultiIndex by passing the tuples to the DataFrame constructor
>>> df = pd.DataFrame(data, columns=['level_1', 'level_2', 'value', 'category'])
>>> # set the multiIndex
>>> df = df.set_index(['level_1', 'level_2'])
>>> df
value category
level_1 level_2
A a 1 dog
b 2 dog
B a 3 cat
b 4 cat
>>> df.index
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['level_1', 'level_2'])
As you can see, our Index now has two "levels", as it is called. It is represented as a list of tuples, but a MultiIndex is hierarchical in the sense that the first level is used as the main identifier, then the second level and so on. We can index MultiIndexes as follows:
>>> df.loc['A']
value category
level_2
a 1 dog
b 2 dog
>>> df.loc[('A', 'a')]
value 1
category dog
Name: (A, a), dtype: object
This is also where the hierarchical nature of a MultiIndex comes in, we cannot simply access the index on the second level using df.loc["a"]
, it will resort to the first level by default. To access specific indices on different levels we can use xs()
:
>>> df.xs('a', level=1)
value category
level_1
A 1 dog
B 3 cat
Where you need to specify the level that you are indexing. Levels, like everything in Python, are indexed based on 0 so that's why we specify level=1.
Multi-level indexing allows you to perform operations on subsets of the data using the .groupby()
method, and it can be used to perform more complex data reshaping and aggregations. You can call groupby()
using an index level as follows:
>>> df.groupby(level=1).mean(numeric_only=True)
value
level_2
a 2.0
b 3.0
You can apply the exact same hierarchical MultiIndexing to the column axis as well, not just to the rows, this allows for similar complex data reshaping but not to use groupby()
in the same way:
>>> data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> cols = cols = pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'c')])
>>> df = pd.DataFrame(data, columns=cols)
A B
a b c
0 1 2 3
1 4 5 6
2 7 8 9
>>> df.loc[:, ('A')]
a b
0 1 2
1 4 5
2 7 8
>>> df.xs('a', level=1, axis=1)
A
a
0 1
1 4
2 7
You can check out this Pandas docs page on MulitIndexes and advance indexing for more elaborate explanations.
Reshaping and pivot tables
Reshaping by pivoting DataFrame objects
You may be familiar with pivot tables in Excel to generate easy insights into your data. In this post, we’ll explore how to create Python pivot tables using the pivot table function available in Pandas. The function itself is quite easy to use, but it’s not the most intuitive. This post will give you a complete overview of how to use the function!
The function is quite similar to the group by function also available in Pandas, but offers significantly more customization, as we’ll see later on in this post.

Data is often stored in so-called “stacked” or “record” format:
>>> df
date variable value
0 2000-01-03 A 0.441639
1 2000-01-04 A -0.829340
2 2000-01-05 A 0.534826
3 2000-01-03 B 0.950670
4 2000-01-04 B -0.736797
5 2000-01-05 B -0.712704
6 2000-01-03 C 1.829353
7 2000-01-04 C -1.893993
8 2000-01-05 C 1.198726
9 2000-01-03 D -0.469722
10 2000-01-04 D 0.793962
11 2000-01-05 D -1.632517
For the curious, here is how that DataFrame is created, but you don't have to worry too much about it:
import numpy as np
import pandas as pd
import pandas._testing as tm
def unpivot(frame):
N, K = frame.shape
data = {
"value": frame.to_numpy().ravel("F"),
"variable": np.asarray(frame.columns).repeat(N),
"date": np.tile(np.asarray(frame.index), K),
}
return pd.DataFrame(data, columns=["date", "variable", "value"])
df = unpivot(tm.makeTimeDataFrame(3))
We have learned in the previous module how to access data of a specific category:
>>> df.loc[df["variable"] == "A"]
date variable value
0 2000-01-03 A 0.441639
1 2000-01-04 A -0.829340
2 2000-01-05 A 0.534826
However, suppose we wish to do some analysis on how all of the variables (A, B, C) differ over time. A better representation would be a DataFrame where the columns are the unique variables and an index of dates identifies individual observations. To reshape the data into this form, we use the DataFrame.pivot()
method (also implemented as a top level function pivot()
):
>>> df.pivot(index="date", columns="variable", values="value")
variable A B C D
date
2000-01-03 0.441639 0.950670 1.829353 -0.469722
2000-01-04 -0.829340 -0.736797 -1.893993 0.793962
2000-01-05 0.534826 -0.712704 1.198726 -1.632517
Now we can index by the date, and easily compute statistics of variables using what we have learned last week (df["A"].mean()
, for example).
If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column:
>>> df["value2"] = df["value"] * 2
>>> pivoted = df.pivot(index="date", columns="variable")
>>> pivoted
value value2
variable A B C D A B C D
date
2000-01-03 0.441639 0.950670 1.829353 -0.469722 2.441639 2.950670 3.829353 1.530278
2000-01-04 -0.829340 -0.736797 -1.893993 0.793962 1.170660 1.263203 0.106007 2.793962
2000-01-05 0.534826 -0.712704 1.198726 -1.632517 2.534826 1.287296 3.198726 0.367483
You can then select subsets from the pivoted DataFrame:
>>> pivoted["value2"]
variable A B C D
date
2000-01-03 2.441639 2.950670 3.829353 1.530278
2000-01-04 1.170660 1.263203 0.106007 2.793962
2000-01-05 2.534826 1.287296 3.198726 0.367483
Reshaping by stacking and unstacking

Closely related to the pivot()
method are the related stack()
and unstack()
methods available on Series and DataFrames. These methods are designed to work together with MultiIndex objects. Here are essentially what these methods do:
stack
: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.unstack
: (inverse operation of stack) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

The clearest way to explain is by example. Let’s take an example data set that looks as follows and has an hierarchical index:
>>> tuples = [
... ('bar', 'one'),
... ('bar', 'two'),
... ('baz', 'one'),
... ('baz', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
>>> df = pd.DataFrame(np.random.randn(4, 2), index=index, columns=["A", "B"])
>>> df
A B
first second
bar one -0.159795 0.158122
two 0.868592 0.135956
baz one -0.098369 0.386648
two 2.332208 -0.274276
The stack function “compresses” a level in the DataFrame’s columns to produce either:
A Series, in the case of a simple column Index.
A DataFrame, in the case of a MultiIndex in the columns.
If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:
>>> stacked = df.stack()
>>> stacked
first second
bar one A -0.159795
B 0.158122
two A 0.868592
B 0.135956
baz one A -0.098369
B 0.386648
two A 2.332208
B -0.274276
dtype: float64
Now you have moved on of the columns to be part of the MultiIndex! With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:
>>> stacked.unstack()
A B
first second
bar one -0.159795 0.158122
two 0.868592 0.135956
baz one -0.098369 0.386648
two 2.332208 -0.274276

You can unstack different levels. If the indexes have names, you can use the level names instead of specifying the level numbers:
>>> stacked.unstack(1)
second one two
first
bar A -0.159795 0.868592
B 0.158122 0.135956
baz A -0.098369 2.332208
B 0.386648 -0.274276
>>> stacked.unstack("first")
first bar baz
second
one A -0.159795 -0.098369
B 0.158122 0.386648
two A 0.868592 2.332208
B 0.135956 -0.274276

Missing data
These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by calling sort_index, of course).
Unstacking can result in missing values if subgroups do not have the same set of labels. By default, missing values will be replaced with the default fill value for that data type, NaN for float, NaT for datetimelike, etc. For integer types, by default data will converted to float and missing values will be set to NaN.
In general, it is good to check when stacking and unstacking how the data within your newly created index behaves to make sure that you are getting the result you are looking for.
Reshaping by melt

The top-level melt()
function and the corresponding DataFrame.melt()
are useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names of those columns can be customized by supplying the var_name and value_name parameters. Consider the following data:
>>> data = {
... "first": ["John", "Mary"],
... "last": ["Doe", "Bo"],
... "height": [5.5, 6.0],
... "weight": [130, 150],
... }
>>> df = pd.DataFrame(data)
>>> df
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
We can then do the following to let the height and weight columns function as variables and their values as values:
>>> df.melt(id_vars=["first", "last"])
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
We can rename the new variable column as we please:
>>> cheese.melt(id_vars=["first", "last"], var_name="quantity")
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
When transforming a DataFrame using melt(), the index will be ignored. The original index values can be kept around by setting the ignore_index parameter to False (default is True). This will however duplicate them.
>>> index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
>>> df = pd.DataFrame(data=data, index=index)
>>> df
first last height weight
person A John Doe 5.5 130
B Mary Bo 6.0 150
If we perform the same operation as before the index will be ignored:
>>> df.melt(id_vars=["first", "last"])
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
Unless told otherwise:
>>> df.melt(id_vars=["first", "last"], ignore_index=False)
first last variable value
person A John Doe height 5.5
B Mary Bo height 6.0
A John Doe weight 130.0
B Mary Bo weight 150.0
Combining with stats and GroupBy
It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and DataFrame statistical functions can produce some very expressive and fast data manipulations. Here is some data:
>>> tuples = [
... ('bar', 'one'),
... ('bar', 'two'),
... ('baz', 'one'),
... ('baz', 'two'),
... ('foo', 'one'),
... ('foo', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
>>> tuples = [
... ('A', 'a'),
... ('A', 'b'),
... ('B', 'a'),
... ('B', 'b')]
>>> columns = pd.MultiIndex.from_tuples(tuples, names=["big", "small"])
>>> df = pd.DataFrame(data=np.random.randn(6, 4), index=index, columns=columns)
>>> df
big A B
small a b a b
first second
bar one 0.234754 1.338871 -1.856121 -0.762247
two -1.600562 -0.978893 0.384906 1.001843
baz one 0.323573 0.846493 0.996401 -0.178696
two 0.478565 0.469034 0.039025 -0.759669
foo one 0.144314 -0.467715 -1.071996 -0.839940
two -0.987091 0.723771 -0.358198 -0.918023
You can combine the methods we have used so far to get the average values for each row for all columns on the second level as follows:
>>> df.stack().mean(axis=1).unstack()
small a b
first second
bar one -0.810684 0.288312
two -0.607828 0.011475
baz one 0.659987 0.333899
two 0.258795 -0.145318
foo one -0.463841 -0.653827
two -0.672644 -0.097126
>>> # similarly
>>> df.groupby(level=1, axis=1).mean()
small a b
first second
bar one -0.810684 0.288312
two -0.607828 0.011475
baz one 0.659987 0.333899
two 0.258795 -0.145318
foo one -0.463841 -0.653827
two -0.672644 -0.097126
Please pay attention to how the use of different axes influences the data manipulation here! We can get the averages for the first column level as well:
big A B
first second
bar one 0.786812 -1.309184
two -1.289728 0.693374
baz one 0.585033 0.408853
two 0.473799 -0.360322
foo one -0.161700 -0.955968
two -0.131660 -0.638111
Or we can group by the second index level, and take the average. Then if we want to index by the bigger letter group, we can simply use stack()
again:
>>> df.groupby(level=1, axis=0).mean()
big A B
small a b a b
second
one 0.234214 0.572550 -0.643905 -0.593628
two -0.703029 0.071304 0.021911 -0.225283
>>> df.groupby(level=1, axis=0).mean().stack(level=0)
small a b
second big
one A 0.234214 0.572550
B -0.643905 -0.593628
two A -0.703029 0.071304
B 0.021911 -0.225283
You can dance around and aggregate and stack and unstack dataframes like this until you find the representation that you desire!
More GroupBy operations
If you want to experiment more with using complex GroupBy operations combined with stacking and pivoting, do check out this Pandas docs page that goes quite in depth on how to use GroupBy to its full potential. Some examples:
You can filter out groups from your DataFrame based on some criterion that you wish. Elements from groups are filtered if they do not satisfy the boolean criterion specified by the function you pass to filter
. For example, here we can remove all elements from all groups if the corresponding mean value for some column does not exceed 3:
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
... 'foo', 'bar'],
... 'B' : [1, 2, 3, 4, 5, 6],
... 'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0
Additionally, we are used to aggregating numerical columns using mean()
and sum()
you can create custom functions to process the data in each group using DataFrame.groupby.apply()
. You can pass a function to apply
that returns a Series, DataFrame or a singular value and it will give you something back accordingly:
>>> df
A B C
0 foo 1 2.0
1 bar 2 5.0
2 foo 3 8.0
3 bar 4 1.0
4 foo 5 2.0
5 bar 6 9.0
>>> df.groupby("A").apply(lambda x: x.mean + x.sum)
B C
A
bar 16.0 20.0
foo 12.0 16.0
>>> df.groupby("A", group_keys=True).apply(lambda x: x / x.mean())
A B C
A
bar 1 NaN 0.500000 1.0
3 NaN 1.000000 0.2
5 NaN 1.500000 1.8
foo 0 NaN 0.333333 0.5
2 NaN 1.000000 2.0df
4 NaN 1.666667 0.5
You can see here that because the second apply method returns a series (every row in the group is divided by the mean of the group), you find it returns multiple rows per group per column. However, if we return a single value per group as in the first example, this is also what we get back.
Pivot tables
While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides pivot_table() for pivoting with aggregation of numeric data.
The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some advanced strategies.
It takes a number of arguments:
data: a DataFrame object.
values: a column or a list of columns to aggregate.
index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function to use for aggregation, defaulting to numpy.mean.
Consider a data set like this:
>>> df = pd.DataFrame(
... {
... "A": ["one", "one", "two", "three"] * 6,
... "B": ["A", "B", "C"] * 8,
... "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
... "D": np.random.randn(24),
... "E": np.random.randn(24),
... "F": [datetime.datetime(2021, i, 1) for i in range(1, 13)]
... + [datetime.datetime(2021, i, 15) for i in range(1, 13)],
... }
... )
>>> df
A B C D E F
0 one A foo 0.876893 1.725516 2021-01-01
1 one B foo -0.145033 -0.653728 2021-02-01
2 two C foo -1.223348 0.106080 2021-03-01
3 three A bar 0.893999 -0.680637 2021-04-01
4 one B bar -0.917856 0.315832 2021-05-01
5 one C bar 1.412261 -0.297173 2021-06-01
6 two A foo 0.123473 -1.574639 2021-07-01
7 three B foo -0.591744 -1.529947 2021-08-01
8 one C foo -0.640334 0.642752 2021-09-01
9 one A bar 0.607874 -0.393234 2021-10-01
10 two B bar -0.408787 0.648159 2021-11-01
11 three C bar 0.557279 -0.584821 2021-12-01
12 one A foo -2.253013 -0.557422 2021-01-15
13 one B foo 1.411905 -0.846590 2021-02-15
14 two C foo 0.571643 0.570896 2021-03-15
15 three A bar -1.887198 0.164673 2021-04-15
16 one B bar 0.521254 0.420366 2021-05-15
17 one C bar 0.793471 -0.262605 2021-06-15
18 two A foo 1.800878 -0.551422 2021-07-15
19 three B foo 0.738391 0.798382 2021-08-15
20 one C foo -0.874090 -1.731075 2021-09-15
21 one A bar -0.321068 0.190298 2021-10-15
22 two B bar 0.156898 -0.779324 2021-11-15
23 three C bar 0.951051 0.654550 2021-12-15
We can produce pivot tables from this data very easily:
>>> pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
C bar foo
A B
one A 0.143403 -0.688060
B -0.198301 0.633436
C 1.102866 -0.757212
three A -0.496599 NaN
B NaN 0.073323
C 0.754165 NaN
two A NaN 0.962176
B -0.125944 NaN
C NaN -0.325852
By default, the aggfunc
used is np.mean
, so if you want to aggregate the values via the mean you can leave it as is.

Alternatively, you could aggregate values by summing:
>>> pd.pivot_table(df, values="D", index=["B"], columns=["A", "C"], aggfunc=np.sum)
A one three two
C bar foo bar foo bar foo
B
A 0.286806 -1.376121 -0.993199 NaN NaN 1.924351
B -0.396603 1.266872 NaN 0.146647 -0.251889 NaN
C 2.205732 -1.514424 1.508330 NaN NaN -0.651704

The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of hierarchy in the columns, similarly to what we have seen for df.pivot():
>>> pd.pivot_table(df, index=["A", "B"], columns=["C"])
D E
C bar foo bar foo
A B
one A 0.143403 -0.688060 -0.101468 0.584047
B -0.198301 0.633436 0.368099 -0.750159
C 1.102866 -0.757212 -0.279889 -0.544161
three A -0.496599 NaN -0.257982 NaN
B NaN 0.073323 NaN -0.365783
C 0.754165 NaN 0.034864 NaN
two A NaN 0.962176 NaN -1.063030
B -0.125944 NaN -0.065583 NaN
C NaN -0.325852 NaN 0.338488
Last updated
Was this helpful?