Week 18

Pandas Dataframes

Data Frame in Pandas

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. You might recognize this structure from the matrices in NumPy, and indeed similarly to in NumPy you can access rows and columns separately and apply all sorts of functions and modifications across both axes. The nice thing about Pandas Dataframes is that they have labels for both the column and the rows.

Creating a Pandas DataFrame

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

>>> import pandas as pd
>>> lst = ["I", "am" , "learning", "Pandas"]
>>> pd.DataFrame(lst)

          0
0         I
1        am
2  learning
3    Pandas

Read data

This section covers: read_csv(), read_excel()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

read_csv()

Reads the data from .csv file

Opening a CSV file through this is easy. But there are many others thing one can do through this function only to change the returned object completely. For instance, one can read a csv file not only locally, but from a URL through read_csv or one can choose what columns needed to export so that we don’t have to edit the array later.

Here is how to load a file in the most basic way:

pd.read_csv('data.csv') 

You have to specify the path of file if it is not same folder

pd.read_csv('C:\\Users\\mrkgl\\Desktop\\PyCoders_new\\data.csv') 

File loading parameters

Some files that you might receive that you want to open in Pandas might not be picture perfect. They might include an index column or not. They may start on the fifth row. It might be the case that halfway there are some rows you want to exclude. The following parameters might help you load the file in the way you want. One can see parameters of any function by pressing shift + tab in jupyter notebook.

read_excel()

Can be used to read data from an excel file:

>>> pd.read_excel('example.xlsx', index_col=0)  

       Name  Value
0   string1      1
1   string2      2
2  #Comment      3

Similarly to read_csv the index and header can be specified via the index_col and header arguments

>>> pd.read_excel('example.xlsx', index_col=None, header=None)
  
     0         1      2
0  NaN      Name  Value
1  0.0   string1      1
2  1.0   string2      2
3  2.0  #Comment      3
  • sheet excel

Let’s say we have an excel file with two sheets – Employees and Cars. The top row contains the header of the table.

Here is the example to read the “Employees” sheet data and printing it.

>>> excel_data_df = pandas.read_excel('records.xlsx', 
>>>                           sheet_name='Employees')
>>> excel_data_df

  EmpID    EmpName EmpRole
0      1     Pankaj     CEO
1      2  David Lee  Editor
2      3   Lisa Ray  Author
  • The first parameter is the name of the excel file.

  • The sheet_name parameter defines the sheet to be read from the excel file.

  • When we print the DataFrame object, the output is a two-dimensional table. It looks similar to an excel sheet records.

DataFrame properties

You can now load in some data into a Dataframe, great! One of the first things you do when receiving some csv file, excel file or SQL table with data is look at the properties of the table to get a grip on its contents, size, etc. This sections covers some methods to do that.

Basic properties

This section covers: DataFrame.head(), DataFrame.shape, DataFrame.dtypes, DataFrame.columns, DataFrame.index

DataFrame.head(n=5) = Return the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
>>>                   'monkey', 'parrot', 'shark', 'whale', 'zebra']})
                   
>>> df.head()

      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey              

In a similar fashion you can useDataFrame.tail()to return the last n rows.

You can get the shape the same way as you are used to from NumPy:

>>> df = pd.DataFrame({'col1' : [1, 2], 'col2':  [3, 4],
>>>                    'col3' : [5, 6]})
>>> df.shape

(2, 3)

As opposed to NumPy matrices though, DataFrames can have multiple data types for different columns. You can check which datatypes are present using DataFrame.dtypes. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype. See the User Guide for more.

>>> df = pd.DataFrame({'float': [1.0],
>>>                    'int': [1],
>>>                    'datetime': [pd.Timestamp('20180310')],
>>>                    'string': ['foo']}               
>>> df.dtypes

float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object

You can gain insight in all the columns in the DataFrame as follows:

>>> df = pd.DataFrame({'Weight':[45, 88, 56, 15, 71],
>>>                    'Name':['Sam', 'Andrea', 'Alex', 'Robin', 'Kia'],
>>>                    'Age':[14, 25, 55, 8, 21]})
>>> df.columns

Index(["Weight", "Name", "Age"], dtype="object")

You can see the returned object is of theIndex type. This is Pandas type for column and index labels. Similarly to DataFrame.columns you can call DataFrame.index to get the index (row labels).

More in-depth properties

This section covers: DataFrame.describe(), Series.unique, DataFrame.nunique, Series.value_counts()

One of the first things you can quickly do to investigate a DataFrame is using describe() to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. By default describe just takes the numerical columns into account:

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()

       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type:

>>> df.describe(include='all')  

       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

For all such functions it is good practice to look them up in the Pandas documentation for interesting default function arguments like include. They are often not included int these examples but worth to explore!

When inspecting individual columns or indexes, the unique() function is used to get unique values of Series object.

Unique values are returned in order of appearance, and is not sorted.

>>> data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
>>>         'year': [2012, 2012, 2013, 2014, 2014], 
>>>         'reports': [4, 24, 31, 2, 3]}
>>> df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
>>> df

             name  year  reports
Cochice     Jason  2012        4
Pima        Molly  2012       24
Santa Cruz   Tina  2013       31
Maricopa     Jake  2014        2
Yuma          Amy  2014        3
>>> df.name.unique()

array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object)

In a similar fashion you can investigate the number of unique values per column as follows, you can use this to gain insight in the data distribution per column:

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
>>> df
   A  B
0  1  1
1  2  1
2  3  1

>>> df.nunique()
A    3
B    1
dtype: int64

You can also get the number of unique values per row, for that we see a similar function argument return:

>>> df.nunique(axis=1)

0    1
1    2
2    2
dtype: int64

You can gain even more insight in the distribution of the data in a column using the value_counts() function. This can also be used for the Index object. The function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NaN values by default.

>>> df = pd.DataFrame(data= {"grades" : [8, 7, 6, 5, 7, np.nan]})
>>> df.grades.value_counts()

7.0    2
8.0    1
6.0    1
5.0    1
Name: grades, dtype: int64

With dropna set to False we can also see NaN index values.

>>> df,name.value_counts(dropna=False)

7.0    2
8.0    1
6.0    1
5.0    1
NaN    1
Name: grades, dtype: int64

Many functions built into Pandas have the option to drop NaNs or not, look up the function documentation to check (simply Google).

Similarly to NumPy you can get all sorts of statistics about columns:

◦ df[‘column name’].min()

◦ df[‘column name’].max()

◦ df[‘column name’].mean()

◦ df[‘column name’].sum()

◦ df[‘column name’].count()

DataFrame Indexing and Selecting

This section covers: loc[], iloc[], where(), isin()

Indexing in Pandas works slightly differently than in NumPy with specific functions to access certain rows and columns. We mainly use the loc[] and iloc[] functionality. You will be using these a lot!

iloc

iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

This allows you to basically slice and index the table like you are used to from NumPy. By default iloc will index the rows, as you can see in the first example:

>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
>>>           {'a': 100, 'b': 200, 'c': 300, 'd': 400},
>>>           {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
>>> df = pd.DataFrame(mydict)
>>> df  
      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

>>> df.iloc[0]
a    1
b    2
c    3
d    4
Name: 0, dtype: int64

You can use a list of indices to return multiple rows, or use slices, just as in NumPy:

>>> df.iloc[[0, 1]]

     a    b    c    d
0    1    2    3    4
1  100  200  300  400
>>> df.iloc[:3]

      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

You can also access specific rows and columns by using comma separated indices:

>>> df.iloc[1, :3]

a    100
b    200
c    300

loc

You can use loc[] to access a group of rows and columns by label(s) or a boolean array.

Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

  • A list or array of labels, e.g. ['a', 'b', 'c'].

  • A slice object with labels, e.g. 'a':'f'.

Warning

Note that contrary to usual python slices, both the start and the stop are included

  • A boolean array of the same length as the axis being sliced, e.g. [True, False, True].

Some examples:

>>>      index=['cobra', 'viper', 'sidewinder'],
>>>      columns=['max_speed', 'shield']) 
>>> df

            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using [[]] returns a DataFrame.

>>> df.loc[['viper', 'sidewinder']]

            max_speed  shield
viper               4       5
sidewinder          7       8

Single label for row and column

>>> df.loc['cobra', 'shield']

2

Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.

>>> df.loc['cobra':'viper', 'max_speed']

cobra    1
viper    4
Name: max_speed, dtype: int64

Similarly to in NumPy arrays, you can use boolean indexing, which works great in combination with loc:

>>> df["max_speed"] > 2
cobra         False
viper          True
sidewinder     True
Name: max_speed, dtype: bool

>>> df.loc[df["max_speed"] > 2]
            max_speed  shield
viper               4       5
sidewinder          7       8

Other indexing functions

isin()

The isin() function is used to check each element in the DataFrame is contained in values or not.

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
>>>                   index=['eagle', 'cat'])
>>> df
       num_legs  num_wings
eagle         2          2
cat           4          0
                  
>>> df.isin([0, 2])
             num_legs	      num_wings
eagle	     True	      True
cat	     False	      True

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})

       num_legs  num_wings
eagle     False      False
cat       False       True

where()

Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64

Similarly there exists np.where!

Manipulating dataframes

This section covers: insert(), drop(), sort_values(), rename()

Adding columns

You are never stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units

>>> data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
>>>         'Height': [5.1, 6.2, 5.1, 5.2],
>>>         'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
>>> df = pd.DataFrame(data)
>>> df

     Name  Height Qualification
0     Jai     5.1           Msc
1  Princi     6.2            MA
2  Gaurav     5.1           Msc
3    Anuj     5.2           Msc  

>>> address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
>>> df['Address'] = address
>>> df
     Name  Height Qualification    Address
0     Jai     5.1           Msc      Delhi
1  Princi     6.2            MA  Bangalore
2  Gaurav     5.1           Msc    Chennai
3    Anuj     5.2           Msc      Patna

Method #2: By using DataFrame.insert()

It gives the freedom to add a column at any position we like and not just at the end. It also provides different options for inserting the column values.

>>> df.insert(2, "Age", [21, 23, 24, 21], True)
>>> df
     Name  Height  Age Qualification    Address
0     Jai     5.1   21           Msc      Delhi
1  Princi     6.2   23            MA  Bangalore
2  Gaurav     5.1   24           Msc    Chennai
3    Anuj     5.2   21           Msc      Patna

drop()

The drop() function is used to drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
>>>                   columns=['P', 'Q', 'R', 'S'])
>>> df

   P  Q   R   S
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

>>> df.drop(['Q', 'R'], axis=1)
	P	S
0	0	3
1	4	7
2	8	11

>>> df.drop(columns=['Q', 'R'])
	P	S
0	0	3
1	4	7
2	8	11

By default rows are dropped:

>>> df.drop([0, 2])

   P  Q  R  S
1  4  5  6  7

rename()

One way of renaming the columns in a Pandas dataframe is by using the rename() function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed. You can also use it to rename the index.

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df
   A  B
0  1  4
1  2  5
2  3  6

>>> df.rename(columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

>>> df.rename({1: 2, 2: 4}, axis='index')
   A  B
0  1  4
2  2  5
4  3  6

You can use rename to change the index from the default range from 0 to len(df) - 1 to a string as follows:

>>> df.index
RangeIndex(start=0, stop=3, step=1)

>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')

sort_values()

Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column. It’s different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected.

>>> df = pd.DataFrame({
>>>     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
>>>     'col2': [2, 1, 9, 8, 7, 4],
>>>     'col3': [0, 1, 9, 4, 2, 3],
>>>     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
>>> })
>>> df
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

>>> df.sort_values(by=['col1'])
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

Sort by multiple columns

>>> df.sort_values(by=['col1', 'col2'])

  col1  col2  col3 col4
1    A     1     1    B
0    A     2     0    a
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

You can use the ascending argument to select whether you want to sort in ascending or descending fashion. If you want to keep the original index, you can use the ignore_index argument.

>>> df.sort_values(by='col1', ascending=False, ignore_index=True)

  col1  col2  col3 col4
0    D     7     2    e
1    C     4     3    F
2    B     9     9    c
3    A     2     0    a
4    A     1     1    B
5  NaN     8     4    D

Many functions that potentially shuffle up the index like this have the optional argument ignore_index. If the index is not set as some specific strings but is simply equal to the row index, you might find it useful to set it to True!

Missing (NaN) Data

This section covers: fillna(), isna(), dropna()

It can often be the case that the data you receive to process contains missing values. Often you want to process these in a certain way because they are not useful for visualization or modelling. This section covers how to find and replace missing values in your DataFrame

fillna()

This function can be used to fill all the NaNs in your dataframe with a specific value:

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
>>>                    [3, 4, np.nan, 1],
>>>                    [np.nan, np.nan, np.nan, 5],
>>>                    [np.nan, 3, np.nan, 4]],
>>>                    columns=list('ABCD'))
>>> df
 
   A    B   C    D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4 

>>> df.fillna(0)

    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

isna()

You can use the isna() function to simply check for NaNs in your dataset. You can use this to for example count the NaNs per column:

>>> df
   A    B   C    D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4 

>>> df.isna()
       A      B     C      D
0   True  False  True  False
1  False  False  True  False
2   True   True  True  False
3   True  False  True  False

>>> df.isna().sum()
A    3
B    1
C    4
D    0

dropna()

We have showed you how to find and replace missing values. In some instances, you might simply want to drop all the rows with missing values in case you think the data is unusable. This is how:

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
>>>                     "toy": [np.nan, 'Batmobile', 'Bullwhip'],
>>>                     "born": [pd.NaT, pd.Timestamp("1940-04-25"),
>>>                              pd.NaT]})
>>> df

       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()

     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')

       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')

       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Combining DataFrames

  • merge() for combining data on common columns or indices

  • .join() for combining data on a key column or an index

  • concat() for combining DataFrames across rows or columns

concat()

You can very simply concatenate pandas objects along a particular axis with optional set logic along the other axes. Just like you are used to from NumPy.

Her two DataFrame objects with identical columns.

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
>>>                    columns=['letter', 'number'])
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
>>>                    columns=['letter', 'number'])

>>> df1
  letter  number
0      a       1
1      b       2

>>> df2
  letter  number
0      c       3
1      d       4

>>> pd.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
>>>                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog

>>> pd.concat([df1, df3], sort=False)
  letter  number animal
0      a       1    NaN
1      b       2    NaN
0      c       3    cat
1      d       4    dog

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> pd.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects horizontally along the x axis by passing in axis=1.

>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
>>>                    columns=['animal', 'name'])
>>>
>>> df4                    
   animal    name
0    bird   polly
1  monkey  george

>>> pd.concat([df1, df4], axis=1)
  letter  number  animal    name
0      a       1    bird   polly
1      b       2  monkey  george

merge()

Merge DataFrame or named Series objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed

When you want to combine data objects based on one or more keys in a similar way to a relational database, merge() is the tool you need. More specifically, merge() is most useful when you want to combine rows that share data.

You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-one join, one of your datasets will have many rows in the merge column that repeat the same values (such as 1, 1, 3, 5, 5), while the merge column in the other dataset will not have repeat values (such as 1, 3, 5).

You can specify what kind of join you want merge to perform, inner, outer, left or right. By default it performs an inner join.

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                        'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                        'value': [5, 6, 7, 8]})
                    
>>> df1
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

>>> df2
  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended

>>> df1.merge(df2, left_on='lkey', right_on='rkey')

  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
>>>          suffixes=('_left', '_right'))          
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

You can also merge on a single column that both dataframes share:

>>> df1 = pd.DataFrame({'PostalCode': ['10001', '10002', '10003'],
...                    'City': ['New York', 'Los Angeles', 'Chicago']})
>>> df2 = pd.DataFrame({'PostalCode': ['10002', '10003', '10004'],
...                    'Population': [100000, 200000, 300000]})
>>> pd.merge(df1, df2, on='PostalCode', how='inner')

  PostalCode         City  Population
0      10002  Los Angeles      100000
1      10003      Chicago      200000

>>> d.merge(df1, df2, on='PostalCode', how='left')
  PostalCode         City  Population
0      10001       New York        NaN
1      10002  Los Angeles   100000.0
2      10003      Chicago   200000.0

Notice the difference between the left and inner joins here.

pd.merge is one of those functions for which it is worth checking out the Pandas docs for more optional function arguments!

join()

Join is a more specific case of merge. Even though merge is the underlying process of join and can in principle always be used, in the specific case where you want to do a left-join and use the index of the second (right) dataframe to merge, you can use join. In general, it is index-preserving. Note the following code example:

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
>>>                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>>                    
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
>>>                       'B': ['B0', 'B1', 'B2']})
>>> df                   
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

>>> other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

>>> df.join(other, lsuffix='_caller', rsuffix='_other')
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

Note that these two operations do exactly the same thing:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

Aggregation in Pandas

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
>>>                               'Parrot', 'Parrot'],
>>>                    'Max Speed': [380., 370., 24., 26.]})
>>> df

   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

We can also choose to include NaN in group keys or not by setting dropna parameter, the default setting is True:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

>>> df.groupby("b").sum()
    a   c
b
1.0 2   3
2.0 2   5

>>> df.groupby(by=["b"], dropna=False).sum()
    a   c
b
1.0 2   3
2.0 2   5
NaN 1   4

Last updated

Change request #338: