Week 18
Pandas Dataframes
Last updated
Pandas Dataframes
Last updated
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. You might recognize this structure from the matrices in NumPy, and indeed similarly to in NumPy you can access rows and columns separately and apply all sorts of functions and modifications across both axes. The nice thing about Pandas Dataframes is that they have labels for both the column and the rows.
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:
This section covers: read_csv(), read_excel()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Reads the data from .csv file
Opening a CSV file through this is easy. But there are many others thing one can do through this function only to change the returned object completely. For instance, one can read a csv file not only locally, but from a URL through read_csv or one can choose what columns needed to export so that we don’t have to edit the array later.
Here is how to load a file in the most basic way:
You have to specify the path of file if it is not same folder
File loading parameters
Some files that you might receive that you want to open in Pandas might not be picture perfect. They might include an index column or not. They may start on the fifth row. It might be the case that halfway there are some rows you want to exclude. The following parameters might help you load the file in the way you want. One can see parameters of any function by pressing shift + tab in jupyter notebook.
Parameter
Use
filepath_or_buffer
URL or Dir location of file
sep
Stands for seperator, default is ‘, ‘ as in csv (comma seperated values)
index_col
header
Makes passed row/s[int/int list] as header
use_cols
Only uses the passed col[string list] to make data frame
squeeze
If true and only one column is passed, returns pandas series
skiprows
Skips passed rows in new data frame
Can be used to read data from an excel file:
Similarly to read_csv
the index and header can be specified via the index_col
and header
arguments
sheet excel
Let’s say we have an excel file with two sheets – Employees and Cars. The top row contains the header of the table.
Here is the example to read the “Employees” sheet data and printing it.
The first parameter is the name of the excel file.
The sheet_name parameter defines the sheet to be read from the excel file.
When we print the DataFrame object, the output is a two-dimensional table. It looks similar to an excel sheet records.
You can now load in some data into a Dataframe, great! One of the first things you do when receiving some csv file, excel file or SQL table with data is look at the properties of the table to get a grip on its contents, size, etc. This sections covers some methods to do that.
This section covers: DataFrame.head(), DataFrame.shape, DataFrame.dtypes, DataFrame.columns, DataFrame.index
DataFrame.head(n=5) =
Return the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
In a similar fashion you can useDataFrame.tail()
to return the last n rows.
You can get the shape the same way as you are used to from NumPy:
As opposed to NumPy matrices though, DataFrames can have multiple data types for different columns. You can check which datatypes are present using DataFrame.dtypes
. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object
dtype. See the User Guide for more.
You can gain insight in all the columns in the DataFrame as follows:
You can see the returned object is of theIndex
type. This is Pandas type for column and index labels. Similarly to DataFrame.columns
you can call DataFrame.index
to get the index (row labels).
This section covers: DataFrame.describe(), Series.unique, DataFrame.nunique, Series.value_counts()
One of the first things you can quickly do to investigate a DataFrame is using describe()
to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. By default describe
just takes the numerical columns into account:
Describing all columns of a DataFrame
regardless of data type:
For all such functions it is good practice to look them up in the Pandas documentation for interesting default function arguments like include
. They are often not included int these examples but worth to explore!
When inspecting individual columns or indexes, the unique() function is used to get unique values of Series object.
Unique values are returned in order of appearance, and is not sorted.
In a similar fashion you can investigate the number of unique values per column as follows, you can use this to gain insight in the data distribution per column:
You can also get the number of unique values per row, for that we see a similar function argument return:
You can gain even more insight in the distribution of the data in a column using the value_counts()
function. This can also be used for the Index
object. The function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NaN values by default.
With dropna set to False we can also see NaN index values.
Many functions built into Pandas have the option to drop NaNs or not, look up the function documentation to check (simply Google).
Similarly to NumPy you can get all sorts of statistics about columns:
◦ df[‘column name’].min()
◦ df[‘column name’].max()
◦ df[‘column name’].mean()
◦ df[‘column name’].sum()
◦ df[‘column name’].count()
This section covers: loc[], iloc[], where(), isin()
Indexing in Pandas works slightly differently than in NumPy with specific functions to access certain rows and columns. We mainly use the loc[]
and iloc[]
functionality. You will be using these a lot!
iloc[]
is primarily integer position based (from 0
to length-1
of the axis), but may also be used with a boolean array.
This allows you to basically slice and index the table like you are used to from NumPy. By default iloc will index the rows, as you can see in the first example:
You can use a list of indices to return multiple rows, or use slices, just as in NumPy:
You can also access specific rows and columns by using comma separated indices:
You can use loc[]
to access a group of rows and columns by label(s) or a boolean array.
Allowed inputs are:
A single label, e.g. 5
or 'a'
, (note that 5
is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c']
.
A slice object with labels, e.g. 'a':'f'
.
Warning
Note that contrary to usual python slices, both the start and the stop are included
A boolean array of the same length as the axis being sliced, e.g. [True, False, True]
.
Some examples:
List of labels. Note using [[]]
returns a DataFrame.
Single label for row and column
Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.
Similarly to in NumPy arrays, you can use boolean indexing, which works great in combination with loc
:
The isin()
function is used to check each element in the DataFrame is contained in values or not.
When values is a dict, we can pass values to check for each column separately:
Pandas where()
method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond
is True
the element is used; otherwise the corresponding element from the DataFrame other
is used.
Similarly there exists np.where
!
This section covers: insert(), drop(), sort_values(), rename()
You are never stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.
You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units
Method #2: By using DataFrame.insert()
It gives the freedom to add a column at any position we like and not just at the end. It also provides different options for inserting the column values.
The drop() function is used to drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
By default rows are dropped:
rename()
One way of renaming the columns in a Pandas dataframe is by using the rename()
function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed. You can also use it to rename the index.
You can use rename to change the index from the default range from 0 to len(df) - 1
to a string as follows:
Pandas sort_values()
function sorts a data frame in Ascending or Descending order of passed Column. It’s different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected.
Sort by multiple columns
You can use the ascending
argument to select whether you want to sort in ascending or descending fashion. If you want to keep the original index, you can use the ignore_index
argument.
Many functions that potentially shuffle up the index like this have the optional argument ignore_index
. If the index is not set as some specific strings but is simply equal to the row index, you might find it useful to set it to True
!
This section covers: fillna(), isna(), dropna()
It can often be the case that the data you receive to process contains missing values. Often you want to process these in a certain way because they are not useful for visualization or modelling. This section covers how to find and replace missing values in your DataFrame
This function can be used to fill all the NaNs in your dataframe with a specific value:
You can use the isna()
function to simply check for NaNs in your dataset. You can use this to for example count the NaNs per column:
We have showed you how to find and replace missing values. In some instances, you might simply want to drop all the rows with missing values in case you think the data is unusable. This is how:
Drop the rows where at least one element is missing.
Drop the columns where at least one element is missing.
Drop the rows where all elements are missing.
merge()
for combining data on common columns or indices
.join()
for combining data on a key column or an index
concat()
for combining DataFrames across rows or columns
You can very simply concatenate pandas objects along a particular axis with optional set logic along the other axes. Just like you are used to from NumPy.
Her two DataFrame
objects with identical columns.
Combine DataFrame
objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN
values.
Combine DataFrame
objects with overlapping columns and return only those that are shared by passing inner
to the join
keyword argument.
Combine DataFrame
objects horizontally along the x axis by passing in axis=1
.
Merge DataFrame or named Series objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed
When you want to combine data objects based on one or more keys in a similar way to a relational database, merge()
is the tool you need. More specifically, merge()
is most useful when you want to combine rows that share data.
You can achieve both many-to-one and many-to-many joins with merge()
. In a many-to-one join, one of your datasets will have many rows in the merge column that repeat the same values (such as 1, 1, 3, 5, 5), while the merge column in the other dataset will not have repeat values (such as 1, 3, 5).
You can specify what kind of join you want merge
to perform, inner, outer, left or right. By default it performs an inner join.
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
You can also merge on a single column that both dataframes share:
Notice the difference between the left and inner joins here.
pd.merge
is one of those functions for which it is worth checking out the Pandas docs for more optional function arguments!
Join is a more specific case of merge
. Even though merge
is the underlying process of join and can in principle always be used, in the specific case where you want to do a left-join and use the index of the second (right) dataframe to merge, you can use join. In general, it is index-preserving. Note the following code example:
Note that these two operations do exactly the same thing:
Pandas dataframe.groupby()
function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
We can also choose to include NaN in group keys or not by setting dropna
parameter, the default setting is True
:
Makes passed column as index instead of 0, 1, 2, 3…r