Week 19

Advanced Pandas DataFrames

Advanced Pandas DataFrames

In this module we will learn more about Pandas DataFrames. Specifically, we will learn how to pivot around and reshaping our DataFrames so we can better prepare them for the task we want them to perform. Before we dive into the pivoting, we need to now a little bit about MultiIndexes.

MultiIndexing

A MultiIndex, or hierarchical index, is a way to handle multiple levels of indexing in a pandas DataFrame. It allows you to have multiple levels of indexing on an axis. Here is an example of creating a DataFrame with a MultiIndex:

>>> # Create a list of tuples representing the data
>>> data = [
...         ('A', 'a', 1, 'dog'),
...         ('A', 'b', 2, 'dog'),
...         ('B', 'a', 3, 'cat'),
...         ('B', 'b', 4, 'cat')]

>>> # Create a MultiIndex by passing the tuples to the DataFrame constructor
>>> df = pd.DataFrame(data, columns=['level_1', 'level_2', 'value', 'category'])

>>> # set the multiIndex
>>> df = df.set_index(['level_1', 'level_2'])
>>> df
                 value category
level_1 level_2                
A       a            1      dog
        b            2      dog
B       a            3      cat
        b            4      cat

>>> df.index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['level_1', 'level_2'])

As you can see, our Index now has two "levels", as it is called. It is represented as a list of tuples, but a MultiIndex is hierarchical in the sense that the first level is used as the main identifier, then the second level and so on. We can index MultiIndexes as follows:

>>> df.loc['A']
         value category
level_2                
a            1      dog
b            2      dog

>>> df.loc[('A', 'a')]
value         1
category    dog
Name: (A, a), dtype: object

This is also where the hierarchical nature of a MultiIndex comes in, we cannot simply access the index on the second level using df.loc["a"], it will resort to the first level by default. To access specific indices on different levels we can use xs():

>>> df.xs('a', level=1)
         value category
level_1                
A            1      dog
B            3      cat

Where you need to specify the level that you are indexing. Levels, like everything in Python, are indexed based on 0 so that's why we specify level=1.

Multi-level indexing allows you to perform operations on subsets of the data using the .groupby() method, and it can be used to perform more complex data reshaping and aggregations. You can call groupby() using an index level as follows:

>>> df.groupby(level=1).mean(numeric_only=True)
         value
level_2       
a          2.0
b          3.0

You can apply the exact same hierarchical MultiIndexing to the column axis as well, not just to the rows, this allows for similar complex data reshaping but not to use groupby() in the same way:

>>> data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> cols = cols = pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'c')])
>>> df = pd.DataFrame(data, columns=cols)
   A    B
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

>>> df.loc[:, ('A')]
   a  b
0  1  2
1  4  5
2  7  8

>>> df.xs('a', level=1, axis=1)
   A
   a
0  1
1  4
2  7

You can check out this Pandas docs page on MulitIndexes and advance indexing for more elaborate explanations.

Reshaping and pivot tables

Reshaping by pivoting DataFrame objects

You may be familiar with pivot tables in Excel to generate easy insights into your data. In this post, we’ll explore how to create Python pivot tables using the pivot table function available in Pandas. The function itself is quite easy to use, but it’s not the most intuitive. This post will give you a complete overview of how to use the function!

The function is quite similar to the group by function also available in Pandas, but offers significantly more customization, as we’ll see later on in this post.

Data is often stored in so-called “stacked” or “record” format:

>>> df
         date variable     value
0  2000-01-03        A  0.441639
1  2000-01-04        A -0.829340
2  2000-01-05        A  0.534826
3  2000-01-03        B  0.950670
4  2000-01-04        B -0.736797
5  2000-01-05        B -0.712704
6  2000-01-03        C  1.829353
7  2000-01-04        C -1.893993
8  2000-01-05        C  1.198726
9  2000-01-03        D -0.469722
10 2000-01-04        D  0.793962
11 2000-01-05        D -1.632517

For the curious, here is how that DataFrame is created, but you don't have to worry too much about it:

import numpy as np
import pandas as pd
import pandas._testing as tm

def unpivot(frame):
    N, K = frame.shape
    data = {
        "value": frame.to_numpy().ravel("F"),
        "variable": np.asarray(frame.columns).repeat(N),
        "date": np.tile(np.asarray(frame.index), K),
    }
    return pd.DataFrame(data, columns=["date", "variable", "value"])

df = unpivot(tm.makeTimeDataFrame(3))

We have learned in the previous module how to access data of a specific category:

>>> df.loc[df["variable"] == "A"]
        date variable     value
0 2000-01-03        A  0.441639
1 2000-01-04        A -0.829340
2 2000-01-05        A  0.534826

However, suppose we wish to do some analysis on how all of the variables (A, B, C) differ over time. A better representation would be a DataFrame where the columns are the unique variables and an index of dates identifies individual observations. To reshape the data into this form, we use the DataFrame.pivot() method (also implemented as a top level function pivot()):

>>> df.pivot(index="date", columns="variable", values="value")
variable           A         B         C         D
date                                              
2000-01-03  0.441639  0.950670  1.829353 -0.469722
2000-01-04 -0.829340 -0.736797 -1.893993  0.793962
2000-01-05  0.534826 -0.712704  1.198726 -1.632517

Now we can index by the date, and easily compute statistics of variables using what we have learned last week (df["A"].mean(), for example).

If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column:

>>> df["value2"] = df["value"] * 2
>>> pivoted = df.pivot(index="date", columns="variable")
>>> pivoted
               value                                  value2                              
variable           A         B         C         D         A         B         C         D
date                                                                                      
2000-01-03  0.441639  0.950670  1.829353 -0.469722  2.441639  2.950670  3.829353  1.530278
2000-01-04 -0.829340 -0.736797 -1.893993  0.793962  1.170660  1.263203  0.106007  2.793962
2000-01-05  0.534826 -0.712704  1.198726 -1.632517  2.534826  1.287296  3.198726  0.367483

You can then select subsets from the pivoted DataFrame:

>>> pivoted["value2"]
variable           A         B         C         D
date                                              
2000-01-03  2.441639  2.950670  3.829353  1.530278
2000-01-04  1.170660  1.263203  0.106007  2.793962
2000-01-05  2.534826  1.287296  3.198726  0.367483

Reshaping by stacking and unstacking

Closely related to the pivot() method are the related stack() and unstack() methods available on Series and DataFrames. These methods are designed to work together with MultiIndex objects. Here are essentially what these methods do:

  • stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.

  • unstack: (inverse operation of stack) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

The clearest way to explain is by example. Let’s take an example data set that looks as follows and has an hierarchical index:

>>> tuples = [
...           ('bar', 'one'),
...           ('bar', 'two'),
...           ('baz', 'one'),
...           ('baz', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
>>> df = pd.DataFrame(np.random.randn(4, 2), index=index, columns=["A", "B"])
>>> df
                     A         B
first second                    
bar   one    -0.159795  0.158122
      two     0.868592  0.135956
baz   one    -0.098369  0.386648
      two     2.332208 -0.274276

The stack function “compresses” a level in the DataFrame’s columns to produce either:

  • A Series, in the case of a simple column Index.

  • A DataFrame, in the case of a MultiIndex in the columns.

If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:

>>> stacked = df.stack()
>>> stacked

first  second   
bar    one     A   -0.159795
               B    0.158122
       two     A    0.868592
               B    0.135956
baz    one     A   -0.098369
               B    0.386648
       two     A    2.332208
               B   -0.274276
dtype: float64

Now you have moved on of the columns to be part of the MultiIndex! With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

>>> stacked.unstack()

                     A         B
first second                    
bar   one    -0.159795  0.158122
      two     0.868592  0.135956
baz   one    -0.098369  0.386648
      two     2.332208 -0.274276

You can unstack different levels. If the indexes have names, you can use the level names instead of specifying the level numbers:

>>> stacked.unstack(1)
second        one       two
first                      
bar   A -0.159795  0.868592
      B  0.158122  0.135956
baz   A -0.098369  2.332208
      B  0.386648 -0.274276    

>>> stacked.unstack("first")
first          bar       baz
second                      
one    A -0.159795 -0.098369
       B  0.158122  0.386648
two    A  0.868592  2.332208
       B  0.135956 -0.274276

Missing data

These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by calling sort_index, of course).

Unstacking can result in missing values if subgroups do not have the same set of labels. By default, missing values will be replaced with the default fill value for that data type, NaN for float, NaT for datetimelike, etc. For integer types, by default data will converted to float and missing values will be set to NaN.

In general, it is good to check when stacking and unstacking how the data within your newly created index behaves to make sure that you are getting the result you are looking for.

Reshaping by melt

The top-level melt() function and the corresponding DataFrame.melt() are useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names of those columns can be customized by supplying the var_name and value_name parameters. Consider the following data:

>>> data = {
...            "first": ["John", "Mary"],
...            "last": ["Doe", "Bo"],
...            "height": [5.5, 6.0],
...            "weight": [130, 150],
...        }
>>> df = pd.DataFrame(data)
>>> df
  first last  height  weight
0  John  Doe     5.5     130
1  Mary   Bo     6.0     150

We can then do the following to let the height and weight columns function as variables and their values as values:

>>> df.melt(id_vars=["first", "last"])
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

We can rename the new variable column as we please:

>>> cheese.melt(id_vars=["first", "last"], var_name="quantity")
  first last quantity  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

When transforming a DataFrame using melt(), the index will be ignored. The original index values can be kept around by setting the ignore_index parameter to False (default is True). This will however duplicate them.

>>> index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
>>> df = pd.DataFrame(data=data, index=index)
>>> df
         first last  height  weight
person A  John  Doe     5.5     130
       B  Mary   Bo     6.0     150

If we perform the same operation as before the index will be ignored:

>>> df.melt(id_vars=["first", "last"])
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

Unless told otherwise:

>>> df.melt(id_vars=["first", "last"], ignore_index=False)
         first last variable  value
person A  John  Doe   height    5.5
       B  Mary   Bo   height    6.0
       A  John  Doe   weight  130.0
       B  Mary   Bo   weight  150.0

Combining with stats and GroupBy

It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and DataFrame statistical functions can produce some very expressive and fast data manipulations. Here is some data:

>>> tuples = [
...           ('bar', 'one'),
...           ('bar', 'two'),
...           ('baz', 'one'),
...           ('baz', 'two'),
...           ('foo', 'one'),
...           ('foo', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
>>> tuples = [
...            ('A', 'a'),
...            ('A', 'b'),
...            ('B', 'a'),
...            ('B', 'b')]
>>> columns = pd.MultiIndex.from_tuples(tuples, names=["big", "small"])
>>> df = pd.DataFrame(data=np.random.randn(6, 4), index=index, columns=columns)
>>> df
big                  A                   B          
small                a         b         a         b
first second                                        
bar   one     0.234754  1.338871 -1.856121 -0.762247
      two    -1.600562 -0.978893  0.384906  1.001843
baz   one     0.323573  0.846493  0.996401 -0.178696
      two     0.478565  0.469034  0.039025 -0.759669
foo   one     0.144314 -0.467715 -1.071996 -0.839940
      two    -0.987091  0.723771 -0.358198 -0.918023

You can combine the methods we have used so far to get the average values for each row for all columns on the second level as follows:

>>> df.stack().mean(axis=1).unstack()
small                a         b
first second                    
bar   one    -0.810684  0.288312
      two    -0.607828  0.011475
baz   one     0.659987  0.333899
      two     0.258795 -0.145318
foo   one    -0.463841 -0.653827
      two    -0.672644 -0.097126

>>> # similarly
>>> df.groupby(level=1, axis=1).mean()
small                a         b
first second                    
bar   one    -0.810684  0.288312
      two    -0.607828  0.011475
baz   one     0.659987  0.333899
      two     0.258795 -0.145318
foo   one    -0.463841 -0.653827
      two    -0.672644 -0.097126

Please pay attention to how the use of different axes influences the data manipulation here! We can get the averages for the first column level as well:

big                  A         B
first second                    
bar   one     0.786812 -1.309184
      two    -1.289728  0.693374
baz   one     0.585033  0.408853
      two     0.473799 -0.360322
foo   one    -0.161700 -0.955968
      two    -0.131660 -0.638111

Or we can group by the second index level, and take the average. Then if we want to index by the bigger letter group, we can simply use stack() again:

>>> df.groupby(level=1, axis=0).mean()
big            A                   B          
small          a         b         a         b
second                                        
one     0.234214  0.572550 -0.643905 -0.593628
two    -0.703029  0.071304  0.021911 -0.225283

>>> df.groupby(level=1, axis=0).mean().stack(level=0)
small              a         b
second big                    
one    A    0.234214  0.572550
       B   -0.643905 -0.593628
two    A   -0.703029  0.071304
       B    0.021911 -0.225283

You can dance around and aggregate and stack and unstack dataframes like this until you find the representation that you desire!

More GroupBy operations

If you want to experiment more with using complex GroupBy operations combined with stacking and pivoting, do check out this Pandas docs page that goes quite in depth on how to use GroupBy to its full potential. Some examples:

You can filter out groups from your DataFrame based on some criterion that you wish. Elements from groups are filtered if they do not satisfy the boolean criterion specified by the function you pass to filter. For example, here we can remove all elements from all groups if the corresponding mean value for some column does not exceed 3:

>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...                           'foo', 'bar'],
...                    'B' : [1, 2, 3, 4, 5, 6],
...                    'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0

Additionally, we are used to aggregating numerical columns using mean() and sum() you can create custom functions to process the data in each group using DataFrame.groupby.apply(). You can pass a function to apply that returns a Series, DataFrame or a singular value and it will give you something back accordingly:

>>> df
     A  B    C
0  foo  1  2.0
1  bar  2  5.0
2  foo  3  8.0
3  bar  4  1.0
4  foo  5  2.0
5  bar  6  9.0

>>> df.groupby("A").apply(lambda x: x.mean + x.sum)
        B     C
A              
bar  16.0  20.0
foo  12.0  16.0

>>> df.groupby("A", group_keys=True).apply(lambda x: x / x.mean())
         A         B    C
A                        
bar 1  NaN  0.500000  1.0
    3  NaN  1.000000  0.2
    5  NaN  1.500000  1.8
foo 0  NaN  0.333333  0.5
    2  NaN  1.000000  2.0df
    4  NaN  1.666667  0.5

You can see here that because the second apply method returns a series (every row in the group is divided by the mean of the group), you find it returns multiple rows per group per column. However, if we return a single value per group as in the first example, this is also what we get back.

Pivot tables

While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides pivot_table() for pivoting with aggregation of numeric data.

The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some advanced strategies.

It takes a number of arguments:

  • data: a DataFrame object.

  • values: a column or a list of columns to aggregate.

  • index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

  • columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

  • aggfunc: function to use for aggregation, defaulting to numpy.mean.

Consider a data set like this:

>>> df = pd.DataFrame(
...        {
...            "A": ["one", "one", "two", "three"] * 6,
...            "B": ["A", "B", "C"] * 8,
...            "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
...            "D": np.random.randn(24),
...            "E": np.random.randn(24),
...            "F": [datetime.datetime(2021, i, 1) for i in range(1, 13)]
...            + [datetime.datetime(2021, i, 15) for i in range(1, 13)],
...        }
...    )
>>> df
        A  B    C         D         E          F
0     one  A  foo  0.876893  1.725516 2021-01-01
1     one  B  foo -0.145033 -0.653728 2021-02-01
2     two  C  foo -1.223348  0.106080 2021-03-01
3   three  A  bar  0.893999 -0.680637 2021-04-01
4     one  B  bar -0.917856  0.315832 2021-05-01
5     one  C  bar  1.412261 -0.297173 2021-06-01
6     two  A  foo  0.123473 -1.574639 2021-07-01
7   three  B  foo -0.591744 -1.529947 2021-08-01
8     one  C  foo -0.640334  0.642752 2021-09-01
9     one  A  bar  0.607874 -0.393234 2021-10-01
10    two  B  bar -0.408787  0.648159 2021-11-01
11  three  C  bar  0.557279 -0.584821 2021-12-01
12    one  A  foo -2.253013 -0.557422 2021-01-15
13    one  B  foo  1.411905 -0.846590 2021-02-15
14    two  C  foo  0.571643  0.570896 2021-03-15
15  three  A  bar -1.887198  0.164673 2021-04-15
16    one  B  bar  0.521254  0.420366 2021-05-15
17    one  C  bar  0.793471 -0.262605 2021-06-15
18    two  A  foo  1.800878 -0.551422 2021-07-15
19  three  B  foo  0.738391  0.798382 2021-08-15
20    one  C  foo -0.874090 -1.731075 2021-09-15
21    one  A  bar -0.321068  0.190298 2021-10-15
22    two  B  bar  0.156898 -0.779324 2021-11-15
23  three  C  bar  0.951051  0.654550 2021-12-15

We can produce pivot tables from this data very easily:

>>> pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
C             bar       foo
A     B                    
one   A  0.143403 -0.688060
      B -0.198301  0.633436
      C  1.102866 -0.757212
three A -0.496599       NaN
      B       NaN  0.073323
      C  0.754165       NaN
two   A       NaN  0.962176
      B -0.125944       NaN
      C       NaN -0.325852

By default, the aggfunc used is np.mean, so if you want to aggregate the values via the mean you can leave it as is.

Alternatively, you could aggregate values by summing:

>>> pd.pivot_table(df, values="D", index=["B"], columns=["A", "C"], aggfunc=np.sum)
A       one               three                 two          
C       bar       foo       bar       foo       bar       foo
B                                                            
A  0.286806 -1.376121 -0.993199       NaN       NaN  1.924351
B -0.396603  1.266872       NaN  0.146647 -0.251889       NaN
C  2.205732 -1.514424  1.508330       NaN       NaN -0.651704

The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of hierarchy in the columns, similarly to what we have seen for df.pivot():

>>> pd.pivot_table(df, index=["A", "B"], columns=["C"])
                D                   E          
C             bar       foo       bar       foo
A     B                                        
one   A  0.143403 -0.688060 -0.101468  0.584047
      B -0.198301  0.633436  0.368099 -0.750159
      C  1.102866 -0.757212 -0.279889 -0.544161
three A -0.496599       NaN -0.257982       NaN
      B       NaN  0.073323       NaN -0.365783
      C  0.754165       NaN  0.034864       NaN
two   A       NaN  0.962176       NaN -1.063030
      B -0.125944       NaN -0.065583       NaN
      C       NaN -0.325852       NaN  0.338488

Last updated