🍃MongoDB

MongoDB

In this tutorial, you'll learn MongoDB and how we use Python with MongoDB. Before starting the tutorial, let's remember the difference between structured and unstructured data and why and in which situations we use MongoDB.

Prerequisites

  1. We recommend that you set up a MongoDB Atlas free tier cluster for this tutorial. You can follow this tutorial to get your free MongoDB Atlas Cluster.

  2. Install MongoDB Compass and connect to the cluster you have created. You can follow this guide.

Structured vs Unstructured Data

Structured data is most often categorized as quantitative data, and it's the type of data most of us are used to working with. Think of data that fits neatly within fixed fields and columns in relational databases and spreadsheets. Examples of structured data include names, dates, addresses, credit card numbers, stock information, geolocation, and more. Structured data is highly organized and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly using a relational database management system (RDBMS). This is the most attractive feature of structured data. The programming language used for managing structured data is called structured query language, also known as SQL.

Unstructured data is most often categorized as qualitative data, and it cannot be processed and analyzed using conventional data tools and methods. Examples of unstructured data include text, video files, audio files, mobile activity, social media posts, satellite imagery, surveillance imagery – the list goes on and on. Unstructured data is difficult to deconstruct because it has no predefined data model, meaning it cannot be organized in relational databases. Instead, non-relational or NoSQL databases are the best fit for managing unstructured data. Another way to manage unstructured data is to have it flow into a data lake, allowing it to be in its raw, unstructured format.

The Challenge with Structured Databases

We are generating data at an unprecedented pace right now. The scale and size of this data – it’s mind-boggling! Just check out these numbers:

  • Facebook generates four petabytes of data in just one day

  • Google generates twenty petabytes of data every day

  • Furthermore, Large Hadron Collider (27 kilometers long most powerful particle accelerator of the world) generates one petabyte of data per second. Most importantly this data is unstructured

  • 1 Petabyte(PB) = 220 Gigabyte(GB)

Can you imagine using SQL to work with this volume of data? It’s setting yourself up for a nightmare!

SQL is a wonderful language to learn as a data scientist and it does work well when we’re dealing with structured data. But if your organization works with unstructured data, SQL databases can not fulfill the requirements. Structured databases have two major disadvantages:

  • Scalability: It is very difficult to scale as the database grows larger

  • Elasticity: Structured databases need data in a predefined format. If the data is not following the predefined format, relational databases do not store it

So how do we solve this issue? If not SQL then what?

This is where we go for unstructured databases. Among a wide range of such databases, MongoDB is widely used because of its rich query language and quick access with concepts like indexing. In short, MongoDB is best suited for managing big data. Let’s see the difference between structured and unstructured databases:

What is MongoDB?

MongoDB is an open-source document-based database system. Initially released 12 years ago, MongoDB was one of the first No-SQL system on the market, and therefore quickly gained the interest of developers. It is the most popular document database system currently on the market.

Useful links and free MongoDB courses:

MongoDB Atlas

MongoDB Atlas is a fully managed cloud database developed by the same people that build MongoDB. Atlas handles all the complexity of deploying, managing, and healing your deployments on the cloud service provider of your choice (AWS, Azure, and GCP).

MongoDB Compass

MongoDB Compass is a powerful GUI for querying, aggregating, and analyzing your MongoDB data in a visual environment. Compass is free to use and source available, and can be run on macOS, Windows, and Linux.

The Architecture of a MongoDB Database

The information in MongoDB is stored in documents. Here, a document is analogous to rows in structured databases.

  • Each document is a collection of key-value pairs

  • Each key-value pair is called a field

  • Every document has an _id field, which uniquely identifies the documents

  • A document may also contain nested documents

  • Documents may have a varying number of fields (they can be blank as well)

These documents are stored in a collection. A collection is literally a collection of documents in MongoDB. This is analogous to tables in traditional databases.

Unlike traditional databases, the data is generally stored in a single collection in MongoDB, so there is no concept of joins (except $lookup operator, which performs left-outer-join like operation). MongoDB has the nested document instead.

Using Python with MongoDB

PyMongo is the official MongoDB Python driver for MongoDB. We recommend you use this driver to work with MongoDB from Python. It is easy to use and offers an intuitive API for accessing databases, collections, and documents.

Objects retrieved from MongoDB through PyMongo are compatible with dictionaries and lists, so we can easily manipulate, iterate, and print them

How MongoDB stores data

MongoDB stores data in JSON-like documents:

# Mongodb document (JSON-style)
document_1 = {
"_id" : "BF00001CFOOD",
"item_name" : "Bread",
"quantity" : 2,
"ingredients" : "all-purpose flour"
}

Python dictionaries look like:

# python dictionary
dict_1 = {
"item_name" : "blender",
"max_discount" : "10%",
"batch_number" : "RR450020FRG",
"price" : 340
}

Read on for an overview of how to get started and deliver on the potential of this powerful combination.

Connecting Python and MongoDB Atlas

PyMongo has a set of packages for Python MongoDB interaction. To install PyMongo, open the command line and type:

python -m pip install pymongo

Tip: If you are getting “ModuleNotFoundError: No module named 'pymongo'” error, uninstall pymongo. Use pip uninstall pymongo command. Then, re-install using the installation command.

For this python mongodb tutorial, we use mongodb srv URI, so let’s install dnspython:

python -m pip install dnspython

Now, we can use pymongo as a python mongodb library in our code with an import statement.

Creating a MongoDB database in Python

The first step to connect python to Atlas is MongoDB cluster setup. Next, create a file named pymongo_test_insert.py in any folder to write pymongo code. Create the mongodb client by adding the following:

from pymongo import MongoClient

def get_database():
    # Provide the mongodb atlas url to connect python to mongodb using pymongo
    CONNECTION_STRING = "mongodb+srv://<username>:<password>@<cluster-name>.mongodb.net/myFirstDatabase"

    # Create a connection using MongoClient. You can import MongoClient or use pymongo.MongoClient
    client = MongoClient(CONNECTION_STRING)

    # Create the database for our example 
    db = client['user_shopping_list']
    return db
    
# This is added so that many files can reuse the function get_database()
if __name__ == "__main__":    
    # Get the database
    db = get_database()

Use the connection_string to create the mongoclient and get the MongoDB database connection. Change the username, password and cluster name. In this python mongodb tutorial, we will create a shopping list and add a few items. For this, we created a database 'user_shopping_list'.

Note: MongoDB doesn’t create a database until you have collections and documents in it. So, let’s create a collection next.

Creating a collection in Python

To create a collection, pass the collection name to the database. Make sure to have the right indentation while copying the code to your .py file.

collection = db["user_1_items"]

This creates a collection named user_1_items in the user_shopping_list database.

The list_database_names command shows the names of all the available datasets:

db.client.list_database_names()

Output:

['sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'user_shopping_list',
 'admin',
 'local']

The list_collection_names command shows the names of all the available collections:

db.list_collection_names()

Output:

['user_1_items']

Note: In MongoDB, a collection is not created until it gets content, so if this is your first time creating a collection, you should complete the next part (inserting documents) before you check if the database and collection exist!

Inserting documents in Python

For inserting many documents at once, use the pymongo insert_many() method.

item_1 = {
"_id" : "U1IT00001",
"item_name" : "Blender",
"max_discount" : "10%",
"batch_number" : "RR450020FRG",
"price" : 340,
"category" : "kitchen appliance"
}

item_2 = {
"_id" : "U1IT00002",
"item_name" : "Egg",
"category" : "food",
"quantity" : 12,
"price" : 36,
"item_description" : "brown country eggs"
}

collection.insert_many([item_1,item_2])

Let us insert a third document without specifying the _id field. This time we add a field of data type ‘date’. To add date using pymongo, use the python dateutil package. ISODate will not work with Python, as it is a Mongo shell function.

Install the package using the following command: python -m pip install python-dateutil

Add the following to pymongo_test_insert.py:

from dateutil import parser
expiry_date = '2021-07-13T00:00:00.000Z'
expiry = parser.parse(expiry_date)
item_3 = {
"item_name" : "Bread",
"quantity" : 2,
"ingredients" : "all-purpose flour",
"expiry_date" : expiry
}
collection.insert_one(item_3)

We use the insert_one() method to insert a single document.

Open the command line and navigate to the folder where you have saved pymongo_test_insert.py. Execute the file using the python pymongo_test_insert.py command.

Let’s connect to MongoDB Atlas UI and check what we have so far. Login to your Atlas cluster and click on the collections button. On the left side, you can see the database and collection name that we created. If you click on the collection name, you can view the data as well:

The _id field is of ObjectId type by default. If we don’t specify the _id field, MongoDB generates the same. Not all fields present in one document are present in others. But MongoDB doesn’t stop you from entering data - this is the essence of a schemaless database.

If we insert item_3 again, mongodb will insert a new document, with a new _id value. But, the first two inserts will throw an error because of the _id field, the unique identifier.

The count_documents command shows the number of documents available in “user_1_items” collection and outputs 3.

collection.count_documents({}) 

Querying in Python

Let’s view all the documents together using find(). For that we will create a separate file pymongo_test_query.py:

# Get the database using the method we defined in pymongo_test_insert file 
from pymongo_test_insert import get_database
db = get_database()

# Get collection
collection = db["user_1_items"]

item_details = collection.find()
for item in item_details:
    # This does not give a very readable output
    print(item)

Open the command line and navigate to the folder where you have saved pymongo_test_query.py. Execute the file using the python pymongo_test_query.py command.

We get the list of dictionary object as the output:

{'_id': 'U1IT00001', 'item_name': 'Blender', 'max_discount': '10%', 'batch_number': 'RR450020FRG', 'price': 340, 'category': 'kitchen appliance'}
{'_id': 'U1IT00002', 'item_name': 'Egg', 'category': 'food', 'quantity': 12, 'price': 36, 'item_description': 'brown country eggs'}
{'_id': ObjectId('613cf8c50d1f5ded51ff1c78'), 'item_name': 'Bread', 'quantity': 2, 'ingredients': 'all-purpose flour', 'expiry_date': datetime.datetime(2021, 7, 13, 0, 0)}

The function has returned a dictionary. Let’s see the keys of this dictionary using keys():

for item in item_details:
    print(item.keys())

Output:

dict_keys(['_id', 'item_name', 'max_discount', 'batch_number', 'price', 'category'])
dict_keys(['_id', 'item_name', 'category', 'quantity', 'price', 'item_description'])
dict_keys(['_id', 'item_name', 'quantity', 'ingredients', 'expiry_date'])

We can view the data but the format is not all that great. So, let’s print the item names and their category:

print(item['item_name'], item['category'])

Although Mongodb gets the entire data, we get a python ‘KeyError’ on the third document.

To handle missing data errors in python, use pandas.DataFrames. DataFrames are 2D data structures used for data processing tasks. Pymongo find() method returns dictionary objects which can be converted into a dataframe in a single line of code.

Install pandas library as:

python -m pip install pandas

Replace the for loop with the following code to handle KeyError in one step:

from pandas import DataFrame
# convert the dictionary objects to dataframe
items_df = DataFrame(item_details)

# see the magic
print(items_df)

And don’t forget to comment the print(item['item_name'], item['category'])

The errors are replaced by NaN and NaT for the missing values.

Filter Conditions

We have used find to fetch data from MongoDB. We can also use the find_one function to fetch the documents. find_one fetches only one document at a time. On the other hand, find can fetch multiple documents from the MongoDB collection. But, we don’t need to fetch all the documents all the time. This is where we apply some filter conditions.

Previously we have inserted a document to the MongoDB collection with the item_name field as Egg. Let’s see how to fetch that MongoDB document using the filter condition:

collection.find_one({"item_name":'Egg'})

Output:

{'_id': 'U1IT00002',
 'item_name': 'Egg',
 'category': 'food',
 'quantity': 12,
 'price': 36,
 'item_description': 'brown country eggs'}

Find One

To select data from a collection in MongoDB, we can use the find_one() method. The find_one() method returns the first occurrence in the selection.

Find All

To select data from a table in MongoDB, we can also use the find() method. The find() method returns all occurrences in the selection.

The first parameter of the find() method is a query object. If we use an empty query object, it selects all documents in the collection.

  • Note ⇒ No parameters in the find() method gives you the same result as SELECT * in MySQL.

Return Only Some Fields

The second parameter of the find() method is an object describing which fields to include in the result. This parameter is optional, and if omitted, all fields will be included in the result.

Example: collection.find({},{ "_id": 0, "name": 1, "address": 1 })

Deletion

The delete_one function deletes a single document from the MongoDB collection. Previously we had inserted the document for an item named Bread. Let’s have a look at the MongoDB document inserted:

collection.find_one({'item_name':'Bread'})

Output:

{'_id': ObjectId('613cf8c50d1f5ded51ff1c78'),
 'item_name': 'Bread',
 'quantity': 2,
 'ingredients': 'all-purpose flour',
 'expiry_date': datetime.datetime(2021, 7, 13, 0, 0)}

We will now delete this MongoDB document:

collection.delete_one({'item_name':'Bread'})

<pymongo.results.DeleteResult at 0x1febb0812c8>

Let’s try to fetch this document after deletion. If this document is not available in our MongoDB collection, the find_one function will return nothing.

collection.find_one({'item_name':'Bread'})

Output: Nothing is returned.

Since we get nothing in return, it means that the MongoDB document doesn’t exist anymore.

As we saw that the insert_many function is used to insert multiple documents in MongoDB collection, delete_many is used to delete multiple documents at once. Let’s try to delete two MongoDB documents by the name field:

myquery = ({ "item_name":{"$in": ["Blender","Egg"]}})
x = collection.delete_many(myquery)
print(x.deleted_count, " documents deleted.")

Output: 2 documents deleted.

Here, the deleted count stores the number of deleted MongoDB documents during the operation. The ‘$in’ is an operator in MongoDB.

We have discussed all the basic operations with examples so far. We also understood several theoretical concepts of MongoDB.

Let me share a couple of useful functions of PyMongo:

  • sort: The purpose of this function is to sort the documents

  • limit: This function limits the number of MongoDB documents fetched by the find function

There are more MongoDB functions you can check out here.

Update Collection

You can update a record, or document as it is called in MongoDB, by using the update_one() method. The first parameter of the update_one() method is a query object defining which document to update.

Note: If the query finds more than one record, only the first occurrence is updated.

The second parameter is an object defining the new values of the document.

Example: Change the price from "340" to "360":

The collection before update is:

{'_id': 'U1IT00001', 'item_name': 'Blender', 'max_discount': '10%', 'batch_number': 'RR450020FRG', 'price': 340, 'category': 'kitchen appliance'}
{'_id': 'U1IT00002', 'item_name': 'Egg', 'category': 'food', 'quantity': 12, 'price': 36, 'item_description': 'brown country eggs'}

Update operation:

# Get the database using the method we defined in pymongo_test_insert file 
from pymongo_test_insert import get_database

db = get_database()

collection = db["user_1_items"]

myquery = {"price": 340}
newvalues = {"$set":{ "price": 360}}

collection.update_one(myquery, newvalues)

# print collection after the update:
for item in collection.find():
  print(item)

The collection after update is:

{'_id': 'U1IT00001', 'item_name': 'Blender', 'max_discount': '10%', 'batch_number': 'RR450020FRG', 'price': 360, 'category': 'kitchen appliance'}
{'_id': 'U1IT00002', 'item_name': 'Egg', 'category': 'food', 'quantity': 12, 'price': 36, 'item_description': 'brown country eggs'}

Update Many

To update all documents that meets the criteria of the query, use the update_many() method.

Example: Update quantity of all documents where the id starts with U1IT

Update operation:

# Get the database using the method we defined in pymongo_test_insert file 
from pymongo_test_insert import get_database

db = get_database()

collection = db["user_1_items"]
 
myquery = { "_id": { "$regex": "^U1IT" } }
newvalues = { "$set": { "quantity": 5 } }
 
x = collection.update_many(myquery, newvalues)
 
print(x.modified_count, "documents updated.")

Output:

2 documents updated

The collection after update is:

{'_id': 'U1IT00001', 'item_name': 'Blender', 'max_discount': '10%', 'batch_number': 'RR450020FRG', 'price': 360, 'category': 'kitchen appliance', 'quantity': 5}
{'_id': 'U1IT00002', 'item_name': 'Egg', 'category': 'food', 'quantity': 5, 'price': 36, 'item_description': 'brown country eggs'}

Indexing in Python MongoDB

The number of documents and collections in a real-world database always keeps increasing. It can take a very long time to search for specific documents -- for example, documents that have "all-purpose flour" among their ingredients -- in a very large collection. Indexes make database search faster, efficient, and reduce the cost of querying. For example, sort, count, match, etc.

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to select those documents that match the query statement. This scan is highly inefficient and require MongoDB to process a large volume of data. If an appropriate index exists for a query, the database can use the index to limit the number of documents it must inspect.

MongoDB offers a broad range of index types and features with language-specific sort orders to support complex access patterns to your data. MongoDB indexes can be created and dropped on-demand to accommodate evolving application requirements and query patterns and can be declared on any field within your documents, including fields nested within arrays.

MongoDB defines indexes at the collection level. Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

You can watch the following tutorial to learn how indexing works in MongoDB Compass.

Different types of indexes for a MongoDB collection

There are many different types of indexes in MongoDB that can be used for indexing a collection. Let’s take a brief look at each one:

  • Single-Field Index: As the name implies, this is an index on a single field of a document. For this type of index, the sort order doesn’t affect how efficiently documents are queried. It can be traversed in either descending or ascending order.

  • Compound Index: This type of index can be created on multiple fields. For a compound index, the order that the fields are listed impacts sorting.

  • Multikey Index: This type of index is used to index data that has been stored in arrays.

  • Hashed Indexes: This kind of index is used for large collections that are sharded and spread out over several machines. The hash of the value of a field is indexed instead of the value itself.

Create a simple, single-field index in PyMongo

In the following example, we’ll show a basic API call to the create_index() method. All you need to pass is a string that represents the index name as the parameter:

collection.create_index("an_index")

If you were to print out the results of this API call, it would return a string containing the index name followed by an underscore, and the integer “1” (_1). This indicates that the index will be sorted in the default ascending order.

Create an index for a MongoDB collection, and specify the order

When you use PyMongo to create an index, you can pair the index name with either a 1 or a -1 integer value to explicitly set the sort order to ascending or descending, respectively.

The syntax for this API call is a bit different from the Mongo Shell. The PyMongo API call to create_index() requires that you pass the parameters as a tuple object nested inside of a list. An example of this syntax is shown below:

# create an index in descending order, and return a result
resp = collection.create_index([ ("field_to_index", -1) ])
print ("index response:", resp)

Like the last example, the response returned by the method will be a string containing the field name that was indexed, followed by an underscore (_) and a -1 because we specified that the collection would be indexed in a descending order.

Use DESCENDING or ASCENDING when indexing MongoDB collections instead of -1 and 1 respectively

In our next example, we’ll make the create_index() call more readable by passing the pymongo.DESCENDING or pymongo.ASCENDING attributes instead of a -1 or 1. To do this, we need to make sure to either import all of the classes and attributes of the pymongo library:

from pymongo import *

Another option is to explicitly import them by name when we import the MongoClient class:

from pymongo import MongoClient, ASCENDING, DESCENDING

Create a compound index in Python using more than one field

You can index multiple fields in a collection by passing a Python list like we did in an earlier example; however, this time we’ll add more tuple pairs to the list, resulting in a compound index:

resp = col.create_index(
[
  ("field_to_index", 1),
  ("second_field_indexed", -1)
])
print ("index response:", resp)

MongoDB's aggregation pipelines

All of the pipelines in this post will be executed against the sample_mflix database's movies collection. It contains documents that look like this:

{
   '_id': ObjectId('573a1392f29313caabcdb497'),
   'awards': {'nominations': 7,
               'text': 'Won 1 Oscar. Another 2 wins & 7 nominations.',
               'wins': 3},
   'cast': ['Janet Gaynor', 'Fredric March', 'Adolphe Menjou', 'May Robson'],
   'countries': ['USA'],
   'directors': ['William A. Wellman', 'Jack Conway'],
   'fullplot': 'Esther Blodgett is just another starry-eyed farm kid trying to '
               'break into the movies. Waitressing at a Hollywood party, she '
               'catches the eye of alcoholic star Norman Maine, is given a test, '
               'and is caught up in the Hollywood glamor machine (ruthlessly '
               'satirized). She and her idol Norman marry; but his career '
               'abruptly dwindles to nothing',
   'genres': ['Drama'],
   'imdb': {'id': 29606, 'rating': 7.7, 'votes': 5005},
   'languages': ['English'],
   'lastupdated': '2015-09-01 00:55:54.333000000',
   'plot': 'A young woman comes to Hollywood with dreams of stardom, but '
            'achieves them only with the help of an alcoholic leading man whose '
            'best days are behind him.',
   'poster': 'https://m.media-amazon.com/images/M/MV5BMmE5ODI0NzMtYjc5Yy00MzMzLTk5OTQtN2Q3MzgwOTllMTY3XkEyXkFqcGdeQXVyNjc0MzMzNjA@._V1_SY1000_SX677_AL_.jpg',
   'rated': 'NOT RATED',
   'released': datetime.datetime(1937, 4, 27, 0, 0),
   'runtime': 111,
   'title': 'A Star Is Born',
   'tomatoes': {'critic': {'meter': 100, 'numReviews': 11, 'rating': 7.4},
               'dvd': datetime.datetime(2004, 11, 16, 0, 0),
               'fresh': 11,
               'lastUpdated': datetime.datetime(2015, 8, 26, 18, 58, 34),
               'production': 'Image Entertainment Inc.',
               'rotten': 0,
               'viewer': {'meter': 79, 'numReviews': 2526, 'rating': 3.6},
               'website': 'http://www.vcientertainment.com/Film-Categories?product_id=73'},
   'type': 'movie',
   'writers': ['Dorothy Parker (screen play)',
               'Alan Campbell (screen play)',
               'Robert Carson (screen play)',
               'William A. Wellman (from a story by)',
               'Robert Carson (from a story by)'],
   'year': 1937}

There's a lot of data there, but we'll be focusing mainly on the _id, title, year, and cast fields.

Your First Aggregation Pipeline

Aggregation pipelines are executed by PyMongo using Collection's aggregate() method.

The first argument to aggregate() is a sequence of pipeline stages to be executed. Much like a query, each stage of an aggregation pipeline is a BSON document, and PyMongo will automatically convert a dict into a BSON document for you.

An aggregation pipeline operates on all of the data in a collection. Each stage in the pipeline is applied to the documents passing through, and whatever documents are emitted from one stage are passed as input to the next stage, until there are no more stages left. At this point, the documents emitted from the last stage in the pipeline are returned to the client program, in a similar way to a call to find().

Individual stages, such as $match, can act as a filter, to only pass through documents matching certain criteria. Other stage types, such as $project, $addFields, and $lookup will modify the content of individual documents as they pass through the pipeline. Finally, certain stage types, such as $group, will create an entirely new set of documents based on the documents passed into it taken as a whole. None of these stages change the data that is stored in MongoDB itself. They just change the data before returning it to your program! There is a stage, $set, which can save the results of a pipeline back into MongoDB, but we won’t cover it.

from pymongo import MongoClient, ASCENDING

# Provide the mongodb atlas url to connect python to mongodb using pymongo
CONNECTION_STRING = "mongodb+srv://<username>:<password>@<cluster-name>.mongodb.net/myFirstDatabase"

# Create a connection using MongoClient. You can import MongoClient or use pymongo.MongoClient
client = MongoClient(CONNECTION_STRING)

# Get a reference to the "sample_mflix" database:
db = client["sample_mflix"]
# Get a reference to the "movies" collection:
movie_collection = db["movies"]

The above code will provide a global variable, a Collection object called movie_collection, which points to the movies collection in your database.

Here is some code which creates a pipeline, executes it with aggregate, and then loops through and prints the details of each movie in the results. Paste it into your program.

pipeline = [
   {
      "$match": {
         "title": "A Star Is Born"
      }
   },
   {
      "$sort": {
         "year": ASCENDING
      }
   },
]
results = movie_collection.aggregate(pipeline)
for movie in results:
   print(" * {title}, {first_castmember}, {year}".format(
         title=movie["title"],
         first_castmember=movie["cast"][0],
         year=movie["year"],
   ))

This pipeline has two stages. The first is a $match stage, which is similar to querying a collection with find(). It filters the documents passing through the stage based on a query. Because it's the first stage in the pipeline, its input is all of the documents in the movie collection. The query for the $match stage filters on the title field of the input documents, so the only documents that will be output from this stage will have a title of "A Star Is Born."

The second stage is a $sort stage. Only the documents for the movie "A Star Is Born" are passed to this stage, so the result will be all of the movies called "A Star Is Born," now sorted by their year field, with the oldest movie first.

Calls to aggregate() return a cursor pointing to the resulting documents. The cursor can be looped through like any other sequence. The code above loops through all of the returned documents and prints a short summary, consisting of the title, the first actor in the cast array, and the year the movie was produced.

Executing the code above results in:

* A Star Is Born, Janet Gaynor, 1937
* A Star Is Born, Judy Garland, 1954
* A Star Is Born, Barbra Streisand, 1976

Refactoring the Code

It is possible to build up whole aggregation pipelines as a single data structure, as in the example above, but it's not necessarily a good idea. Pipelines can get long and complex. For this reason, I recommend you build up each stage of your pipeline as a separate variable, and then combine the stages into a pipeline at the end, like this:

# Match title = "A Star Is Born":
stage_match_title = {
   "$match": {
         "title": "A Star Is Born"
   }
}

# Sort by year, ascending:
stage_sort_year_ascending = {
   "$sort": { "year": ASCENDING }
}

# Now the pipeline is easier to read:
pipeline = [
   stage_match_title,
   stage_sort_year_ascending,
]

Limit the Number of Results

Imagine I wanted to obtain the most recent production of "A Star Is Born" from the movies collection. This can be thought of as three stages, executed in order:

  1. Obtain the movie documents for "A Star Is Born."

  2. Sort by year, descending.

  3. Discard all but the first document.

The first stage is already the same as stage_match_title above. The second stage is the same as stage_sort_year_ascending, but with ASCENDING changed to DESCENDING. The third stage is a $limit stage. The modified and new code looks like this:

# Sort by year, descending:
stage_sort_year_descending = {
   "$sort": { "year": DESCENDING }
}

# Limit to 1 document:
stage_limit_1 = { "$limit": 1 }

pipeline = [
   stage_match_title,
   stage_sort_year_descending,
   stage_limit_1,
]

If you make the changes above and execute your code, then you should see just the following line:

* A Star Is Born, Barbra Streisand, 1976

Now you know how to filter, sort, and limit the contents of a collection using an aggregation pipeline. But these are just operations you can already do with find()! Why would you want to use these complex, new-fangled aggregation pipelines? Read on, to see the true power of MongoDB aggregation pipelines.

Look Up Related Data in Other Collections

There's also a collection called comments in the sample_mflix database. Documents in the comments collection look like this:

{
   '_id': ObjectId('5a9427648b0beebeb69579d3'),
   'movie_id': ObjectId('573a1390f29313caabcd4217'),
   'date': datetime.datetime(1983, 4, 27, 20, 39, 15),
   'email': 'cameron_duran@fakegmail.com',
   'name': 'Cameron Duran',
   'text': 'Quasi dicta culpa asperiores quaerat perferendis neque. Est animi '
            'pariatur impedit itaque exercitationem.'
}

It's a comment for a movie. The second field, movie_id, corresponds to the _id value of a document in the movies collection.

So, it's a comment related to a movie!

Does MongoDB enable you to query movies and embed the related comments, like a JOIN in a relational database? Yes, it does! With the $lookup stage.

I'll show you how to obtain related documents from another collection, and embed them in the documents from your primary collection. First, create a new pipeline from scratch, and start with the following:

# Look up related documents in the 'comments' collection:
stage_lookup_comments = {
   "$lookup": {
         "from": "comments",
         "localField": "_id",
         "foreignField": "movie_id",
         "as": "related_comments",
   }
}

# Limit to the first 5 documents:
stage_limit_5 = { "$limit": 5 }

pipeline = [
   stage_lookup_comments,
   stage_limit_5,
]

results = movie_collection.aggregate(pipeline)
for movie in results:
   print(movie)

The stage called stage_lookup_comments is a $lookup stage. This $lookup stage will look up documents from the comments collection that have the same movie id. The matching comments will be listed as an array in a field named 'related_comments,' with an array value containing all of the comments that have this movie's '_id' value as 'movie_id.' Added a $limit stage just to ensure that there's a reasonable amount of output.

Now, execute the code.

The aggregation pipeline above will print out all of the contents of five movie documents. It's quite a lot of data, but if you look carefully, you should see that there's a new field in each document that looks like this:

'related_comments': []

Matching on Array Length

If you're lucky, you may have some documents in the array, but it's unlikely, as most of the movies have no comments. Now, we’ll add some stages to match only movies which have more than two comments.

Ideally, you'd be able to add a single $match stage which obtained the length of the related_comments field and matched it against the expression { "$gt": 2 }. In this case, it's actually two steps:

  • Add a field (comment_count) containing the length of the related_comments field.

  • Match where the value of comment_count is greater than two.

Here is the code for the two stages:

# Calculate the number of comments for each movie:
stage_add_comment_count = {
   "$addFields": {
         "comment_count": {
            "$size": "$related_comments"
         }
   }
}

# Match movie documents with more than 2 comments:
stage_match_with_comments = {
   "$match": {
         "comment_count": {
            "$gt": 2
         }
   }
}

The two stages go after the $lookup stage, and before the $limit 5 stage:

pipeline = [
   stage_lookup_comments,
   stage_add_comment_count,
   stage_match_with_comments,
   stage_limit_5,
]

Now when you run the following code, you will get an output similar to:

results = movie_collection.aggregate(pipeline)
for movie in results:
   print(movie["title"])
   print("Comment count:", movie["comment_count"])

   # Loop through the first 5 comments and print the name and text:
   for comment in movie["related_comments"][:5]:
         print(" * {name}: {text}".format(
            name=comment["name"],
            text=comment["text"]))

Output:

Footsteps in the Fog
--------------------
Comment count: 3
* Sansa Stark: Error ex culpa dignissimos assumenda voluptates vel. Qui inventore quae quod facere veniam quaerat quibusdam. Accusamus ab deleniti placeat non.
* Theon Greyjoy: Animi dolor minima culpa sequi voluptate. Possimus necessitatibus voluptatem hic cum numquam voluptates.
* Donna Smith: Et esse nulla ducimus tempore aliquid. Suscipit iste dignissimos voluptate velit. Laboriosam sequi quae fugiat similique alias. Corporis cumque labore veniam dignissimos.

Now you learned how to work with lookups in your pipelines, Now you’ll learn how to use the $group stage to do actual aggregation.

Grouping Documents with $group

The $group stage is one of the more difficult stages to understand, so we'll break this down slowly. Start with the following code:

from pprint import pprint

# Group movies by year, producing 'year-summary' documents that look like:
# {
#     '_id': 1917,
# }
stage_group_year = {
   "$group": {
         "_id": "$year",
   }
}

pipeline = [
   stage_group_year,
]
results = movie_collection.aggregate(pipeline)

# Loop through the 'year-summary' documents:
for year_summary in results:
   pprint(year_summary)

Execute this code, and you should see something like this:

{'_id': 1978}
{'_id': 1996}
{'_id': 1931}
{'_id': '2000è'}
{'_id': 1960}
{'_id': 1972}
{'_id': 1943}
{'_id': '1997è'}
{'_id': 2010}
{'_id': 2004}
{'_id': 1947}
{'_id': '1987è'}
{'_id': 1954}
...

Each line is a document emitted from the aggregation pipeline. But you're not looking at movie documents anymore. The $group stage groups input documents by the specified _id expression and output one document for each unique _id value. In this case, the expression is $year, which means one document will be emitted for each unique value of the year field. Each document emitted can (and usually will) also contain values generated from aggregating data from the grouped documents.

Change the stage definition to the following:

stage_group_year = {
   "$group": {
         "_id": "$year",
         # Count the number of movies in the group:
         "movie_count": { "$sum": 1 },
   }
}

This will add a movie_count field, containing the result of adding 1 for every document in the group. In other words, it counts the number of movie documents in the group. If you execute the code now, you should see something like the following:

{'_id': '1997è', 'movie_count': 2}
{'_id': 2010, 'movie_count': 970}
{'_id': 1947, 'movie_count': 38}
{'_id': '1987è', 'movie_count': 1}
{'_id': 2012, 'movie_count': 1109}
{'_id': 1954, 'movie_count': 64}
...

There are a number of accumulator operators, like $sum, that allow you to summarize data from the group. If you wanted to build an array of all the movie titles in the emitted document, you could add "movie_titles": { "$push": "$title" }, to the $group stage. In that case, you would get documents that look like this:

{
   '_id': 1917,
   'movie_count': 3,
   'movie_titles': [
      'The Poor Little Rich Girl',
      'Wild and Woolly',
      'The Immigrant'
   ]
}

Something you've probably noticed from the output above is that some of the years contain the "è" character. This database has some messy values in it. In this case, there's only a small handful of documents, and I think we should just remove them. Add the following two stages to only match documents with a numeric year value, and to sort the results:

stage_match_years = {
   "$match": {
         "year": {
            "$type": "number",
         }
   }
}

stage_sort_year_ascending = {
  "$sort": {"_id": ASCENDING}
}

pipeline = [
   stage_match_years,         # Match numeric years
   stage_group_year,
   stage_sort_year_ascending, # Sort by year
]

Note that the $match stage is added to the start of the pipeline, and the $sort is added to the end. A general rule is that you should filter documents out early in your pipeline, so that later stages have fewer documents to deal with. It also ensures that the pipeline is more likely to be able to take advantages of any appropriate indexes assigned to the collection.

Summary

You've learned how to construct aggregation pipelines to filter, group, and join documents with other collections. You've hopefully learned that putting a $limit stage at the start of your pipeline can be useful to speed up development (but should be removed before going to production). You've also learned some basic optimization tips, like putting filtering expressions towards the start of your pipeline instead of towards the end.

As you've gone through, you'll probably have noticed that there's a ton of different stage types, operators, and accumulator operators. Learning how to use the different components of aggregation pipelines is a big part of learning to use MongoDB effectively as a developer.

END OF THE LECTURE

Last updated

Change request #338: