Week 19
Advanced Pandas DataFrames
Advanced Pandas DataFrames
In this module we will learn more about Pandas DataFrames. Specifically, we will learn how to pivot around and reshaping our DataFrames so we can better prepare them for the task we want them to perform. Before we dive into the pivoting, we need to now a little bit about MultiIndexes.
MultiIndexing
A MultiIndex, or hierarchical index, is a way to handle multiple levels of indexing in a pandas DataFrame. It allows you to have multiple levels of indexing on an axis. Here is an example of creating a DataFrame with a MultiIndex:
As you can see, our Index now has two "levels", as it is called. It is represented as a list of tuples, but a MultiIndex is hierarchical in the sense that the first level is used as the main identifier, then the second level and so on. We can index MultiIndexes as follows:
This is also where the hierarchical nature of a MultiIndex comes in, we cannot simply access the index on the second level using df.loc["a"]
, it will resort to the first level by default. To access specific indices on different levels we can use xs()
:
Where you need to specify the level that you are indexing. Levels, like everything in Python, are indexed based on 0 so that's why we specify level=1.
Multi-level indexing allows you to perform operations on subsets of the data using the .groupby()
method, and it can be used to perform more complex data reshaping and aggregations. You can call groupby()
using an index level as follows:
You can apply the exact same hierarchical MultiIndexing to the column axis as well, not just to the rows, this allows for similar complex data reshaping but not to use groupby()
in the same way:
You can check out this Pandas docs page on MulitIndexes and advance indexing for more elaborate explanations.
Reshaping and pivot tables
Reshaping by pivoting DataFrame objects
You may be familiar with pivot tables in Excel to generate easy insights into your data. In this post, we’ll explore how to create Python pivot tables using the pivot table function available in Pandas. The function itself is quite easy to use, but it’s not the most intuitive. This post will give you a complete overview of how to use the function!
The function is quite similar to the group by function also available in Pandas, but offers significantly more customization, as we’ll see later on in this post.
Data is often stored in so-called “stacked” or “record” format:
For the curious, here is how that DataFrame is created, but you don't have to worry too much about it:
We have learned in the previous module how to access data of a specific category:
However, suppose we wish to do some analysis on how all of the variables (A, B, C) differ over time. A better representation would be a DataFrame where the columns are the unique variables and an index of dates identifies individual observations. To reshape the data into this form, we use the DataFrame.pivot()
method (also implemented as a top level function pivot()
):
Now we can index by the date, and easily compute statistics of variables using what we have learned last week (df["A"].mean()
, for example).
If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column:
You can then select subsets from the pivoted DataFrame:
Reshaping by stacking and unstacking
Closely related to the pivot()
method are the related stack()
and unstack()
methods available on Series and DataFrames. These methods are designed to work together with MultiIndex objects. Here are essentially what these methods do:
stack
: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.unstack
: (inverse operation of stack) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.
The clearest way to explain is by example. Let’s take an example data set that looks as follows and has an hierarchical index:
The stack function “compresses” a level in the DataFrame’s columns to produce either:
A Series, in the case of a simple column Index.
A DataFrame, in the case of a MultiIndex in the columns.
If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:
Now you have moved on of the columns to be part of the MultiIndex! With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:
You can unstack different levels. If the indexes have names, you can use the level names instead of specifying the level numbers:
Missing data
These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by calling sort_index, of course).
Unstacking can result in missing values if subgroups do not have the same set of labels. By default, missing values will be replaced with the default fill value for that data type, NaN for float, NaT for datetimelike, etc. For integer types, by default data will converted to float and missing values will be set to NaN.
In general, it is good to check when stacking and unstacking how the data within your newly created index behaves to make sure that you are getting the result you are looking for.
Reshaping by melt
The top-level melt()
function and the corresponding DataFrame.melt()
are useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names of those columns can be customized by supplying the var_name and value_name parameters. Consider the following data:
We can then do the following to let the height and weight columns function as variables and their values as values:
We can rename the new variable column as we please:
When transforming a DataFrame using melt(), the index will be ignored. The original index values can be kept around by setting the ignore_index parameter to False (default is True). This will however duplicate them.
If we perform the same operation as before the index will be ignored:
Unless told otherwise:
Combining with stats and GroupBy
It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and DataFrame statistical functions can produce some very expressive and fast data manipulations. Here is some data:
You can combine the methods we have used so far to get the average values for each row for all columns on the second level as follows:
Please pay attention to how the use of different axes influences the data manipulation here! We can get the averages for the first column level as well:
Or we can group by the second index level, and take the average. Then if we want to index by the bigger letter group, we can simply use stack()
again:
You can dance around and aggregate and stack and unstack dataframes like this until you find the representation that you desire!
More GroupBy operations
If you want to experiment more with using complex GroupBy operations combined with stacking and pivoting, do check out this Pandas docs page that goes quite in depth on how to use GroupBy to its full potential. Some examples:
You can filter out groups from your DataFrame based on some criterion that you wish. Elements from groups are filtered if they do not satisfy the boolean criterion specified by the function you pass to filter
. For example, here we can remove all elements from all groups if the corresponding mean value for some column does not exceed 3:
Additionally, we are used to aggregating numerical columns using mean()
and sum()
you can create custom functions to process the data in each group using DataFrame.groupby.apply()
. You can pass a function to apply
that returns a Series, DataFrame or a singular value and it will give you something back accordingly:
You can see here that because the second apply method returns a series (every row in the group is divided by the mean of the group), you find it returns multiple rows per group per column. However, if we return a single value per group as in the first example, this is also what we get back.
Pivot tables
While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides pivot_table() for pivoting with aggregation of numeric data.
The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some advanced strategies.
It takes a number of arguments:
data: a DataFrame object.
values: a column or a list of columns to aggregate.
index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function to use for aggregation, defaulting to numpy.mean.
Consider a data set like this:
We can produce pivot tables from this data very easily:
By default, the aggfunc
used is np.mean
, so if you want to aggregate the values via the mean you can leave it as is.
Alternatively, you could aggregate values by summing:
The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of hierarchy in the columns, similarly to what we have seen for df.pivot():
Last updated