Querying MongoDB Like an SQL DB Using Aggregation Pipeline

What Are Aggregations?

Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together and can perform a variety of operations on the grouped data to return a single result.

In the db.collection.aggregate method and db.aggregate method, pipeline stages appear in an array. Documents pass through the stages in sequence. We will go through some of the stages to achieve a relational DB like results.


$match (WHERE)

Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.

It has the following prototype:

{ $match: { <query> } }

It is the equivalent of WHERE in SQL queries. Let us take an example to make things clear. This example uses a collection named articles with the following documents:

{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }

Equality match.

db.articles.aggregate(
    [ { $match : { author : "dave" } } ]
);
// Result
{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }

We can have multiple constraints inside $match, like $or, $and, etc. according to our requirements, although it has some limitations as well. You can read about it in the Mongo docs.


$skip (OFFSET)

Skips over the specified number of documents that pass into the stage and passes the remaining documents to the next stage in the pipeline. It has the following prototype:

{ $skip: <positive integer> }

In the above example, we were able to match the records related to dave. If we want to skip the few results from the beginning we would write the query as:

db.articles.aggregate([ 
    { $match : { author : "dave" } },
    { $skip: 1 } 
]);
// We are skipping 1 result and we should get just this
{ "_id" : ObjectId("5dc1d22f24a8e913bfcf4f60"), "author" : "dave", "score" : 85, "views" : 521 }

We generally use $skip with $limit to paginate the data, let’s insert few more records into our collection and see how $skip and $limit work together in the next section.


$limit (LIMIT)

Limits the number of documents passed to the next stage in the pipeline. It has the following prototype:

{ $limit: <positive integer> }

We have added new records for dave, let’s see how the collections look with a simple $match:

db.articles.aggregate(
    [ { $match : { author : "dave" } } ]
);
{ "_id" : ObjectId("5dc1d22124a8e913bfcf4f5f"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("5dc1d22f24a8e913bfcf4f60"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("5dc1d53924a8e913bfcf4f65"), "author" : "dave", "score" : 185, "views" : 1521 }
{ "_id" : ObjectId("5dc1d54f24a8e913bfcf4f66"), "author" : "dave", "score" : 15, "views" : 21 }

We have four matching records in the example collection. Suppose you are asked to paginate the results to show two at a time, how would you go about it? Let’s see.

db.articles.aggregate([ 
    { $match : { author : "dave" } },
    { $skip: 0},
    { $limit: 2}
]);
// We are not skipping any records and but limiting the records to 2
{ "_id" : ObjectId("5dc1d22124a8e913bfcf4f5f"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("5dc1d22f24a8e913bfcf4f60"), "author" : "dave", "score" : 85, "views" : 521 }
// We got the first two results, to get the next two results just update the $skip
db.articles.aggregate([ 
    { $match : { author : "dave" } },
    { $skip: 2},
    { $limit: 2}
]);
// This should give two records after skipping the first two.
{ "_id" : ObjectId("5dc1d53924a8e913bfcf4f65"), "author" : "dave", "score" : 185, "views" : 1521 }
{ "_id" : ObjectId("5dc1d54f24a8e913bfcf4f66"), "author" : "dave", "score" : 15, "views" : 21 }

We are doing good so far, but suppose your manager comes up to you and asks you to sort the result by views. What are you going to do? We have $sort for that.


$sort (ORDER BY)

Sorts all input documents and returns them to the https://www.mongodb.com/docs/manual/core/aggregation-pipeline/ in sorted order. It has the following prototype:

{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }

Let us use the $sort stage in our pipeline.

1 to specify ascending order
-1 to specify descending order
db.articles.aggregate([ 
    { $match : { author : "dave" } },
    { $sort: { views: 1}}
]);
// Result
{ "_id" : ObjectId("5dc1d54f24a8e913bfcf4f66"), "author" : "dave", "score" : 15, "views" : 21 }
{ "_id" : ObjectId("5dc1d22124a8e913bfcf4f5f"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("5dc1d22f24a8e913bfcf4f60"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("5dc1d53924a8e913bfcf4f65"), "author" : "dave", "score" : 185, "views" : 1521 }

Voilà ! The results are sorted now.

Place the $match as early in the aggregation pipeline as possible. Because $match limits the total number of documents in the aggregation pipeline, earlier $match operations minimize the amount of processing down the pipe.


$group

Groups input documents by the specified _id expression and, for each distinct grouping, outputs a document.

The _id field of each output document contains the unique group by value. The output documents can also contain computed fields that hold the values of an accumulator expression. It has the following prototype:

{
  $group:
    {
      _id: <expression>, // Group By Expression
      <field1>: { <accumulator1> : <expression1> },
      ...
    }
}

Suppose we want to group the articles by author, in other words, the number of articles by each author, we can make use of the group stage in the pipeline. So, let’s see it live:

db.articles.aggregate([ 
    { $group : { _id: "$author", count: { $sum: 1 }}},
    { $sort: { count: 1 }}
]);
// We have grouped the articles by author ann getting the count and sorting it by count
{ "_id" : "annT", "count" : 1 }
{ "_id" : "ahn", "count" : 1 }
{ "_id" : "li", "count" : 2 }
{ "_id" : "dave", "count" : 4 }
// We can have more constraints like if we want only the results whose count is greater than 1, then we can add a $match stage in the pipeline after $group
db.articles.aggregate([ 
    { $group : { _id: "$author", count: { $sum: 1 }}},
    { $sort: { count: 1 }},
    { $match: { count : { $gt: 1 }}}
]);
{ "_id" : "li", "count" : 2 }
{ "_id" : "dave", "count" : 4 }

Let us take this grouping up a notch. Suppose we want to group by values stored in an array structure. We have something called $unwind. Let’s see how it works.


$unwind

Deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

You can pass the array field path to $unwind. When using this syntax, $unwind does not output a document if the field value is null, missing, or an empty array. It has the following prototype:

{ $unwind: <field path> }

Let us take a new collection inventory and a new record to it with the following command:

db.inventory.insertOne({ "_id" : 1, "item" : "ABC1", sizes: [ "S", "M", "L"] })

That’s the beauty of MongoDB, you can create a new collection and each document is identical to the input document, except for the value of the sizes field which now holds a value from the original sizes array. Add a record to it without any setup.

Let us $unwind this by the sizes.

db.inventory.aggregate( [ { $unwind : "$sizes" } ] )
// Result
{ "_id" : 1, "item" : "ABC1", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "L" }

Each document is identical to the input document except for the value of the sizes field which now holds a value from the original sizes array.

Let us take a new collection inventory2 and do a group by the size, use this command to insert more records:

db.inventory2.insertMany([
  { "_id" : 1, "item" : "ABC", price: NumberDecimal("80"), "sizes": [ "S", "M", "L"] },
  { "_id" : 2, "item" : "EFG", price: NumberDecimal("120"), "sizes" : [ ] },
  { "_id" : 3, "item" : "IJK", price: NumberDecimal("160"), "sizes": "M" },
  { "_id" : 4, "item" : "LMN" , price: NumberDecimal("10") },
  { "_id" : 5, "item" : "XYZ", price: NumberDecimal("5.75"), "sizes" : null }
])

If we unwind this, we would get something like this:

db.inventory2.aggregate( [ { $unwind: "$sizes" } ] )
// Results
{ "_id" : 1, "item" : "ABC", "price" : NumberDecimal("80"), "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "price" : NumberDecimal("80"), "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "price" : NumberDecimal("80"), "sizes" : "L" }
{ "_id" : 3, "item" : "IJK", "price" : NumberDecimal("160"), "sizes" : "M" }
// Notice it ignores the null and undefined values
db.articles.aggregate([ 
    { $unwind: { path: "$sizes" } },
    { $group: { _id: "$sizes", count: { $sum: 1 }}}
]);
// Results
{ "_id" : "M", "count" : 2 }
{ "_id" : "L", "count" : 1 }
{ "_id" : "S", "count" : 1 }

We can apply different stages to this like $match, $sort, $skip, $limit, etc. to get the desired results.

Now, let’s move on to SQL JOINS, to achieve joins in MongoDB we have $lookup.


$lookup

New in version 3.2.

Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing.

To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection. The $lookup stage passes these reshaped documents to the next stage.

There can be different join conditions but we will be looking into the most basic one, which is an equality match.

Equality match

To perform uncorrelated subqueries between two collections as well as allow other join conditions besides a single equality match. The $lookup stage has the following syntax:

{
   $lookup:
     {
       from: <collection to join>,
       let: { <var_1>: <expression>, …, <var_n>: <expression> },
       pipeline: [ <pipeline to execute on the collection to join> ],
       as: <output array field>
     }
}
from: Specifies the collection in the same database to perform the join with.
let: Optional. Specifies variables to use in the pipeline field stages. Use the variable expressions to access the fields from the documents input to the $lookup stage.
pipeline: Specifies the pipeline to run on the joined collection. The pipeline determines the resulting documents from the joined collection. To return all documents, specify an empty pipeline [].
as: Specifies the name of the new array field to add to the input documents. The new array field contains the matching documents from the from collection. If the specified name already exists in the input document, the existing field is overwritten.

Let us look at some examples to better understand the terminologies:

Perform a single equality join with $lookup

Create a collection orders with the following documents:

db.orders.insert([
   { "_id" : 1, "item" : "almonds", "price" : 12, "quantity" : 2 },
   { "_id" : 2, "item" : "pecans", "price" : 20, "quantity" : 1 },
   { "_id" : 3  }
])

Create another collection inventory with the following documents:

db.inventory.insert([
   { "_id" : 1, "sku" : "almonds", description: "product 1", "instock" : 120 },
   { "_id" : 2, "sku" : "bread", description: "product 2", "instock" : 80 },
   { "_id" : 3, "sku" : "cashews", description: "product 3", "instock" : 60 },
   { "_id" : 4, "sku" : "pecans", description: "product 4", "instock" : 70 },
   { "_id" : 5, "sku": null, description: "Incomplete" },
   { "_id" : 6 }
])

The following aggregation operation on the orders collection joins the documents from orders with the documents from the inventory collection using the fields item from the orders collection and the sku field from the inventory collection:

db.orders.aggregate([
   {
     $lookup:
       {
         from: "inventory",
         localField: "item",
         foreignField: "sku",
         as: "inventory_docs"
       }
  }
]);

The operation returns the following documents:

{
   "_id" : 1,
   "item" : "almonds",
   "price" : 12,
   "quantity" : 2,
   "inventory_docs" : [
      { "_id" : 1, "sku" : "almonds", "description" : "product 1", "instock" : 120 }
   ]
}
{
   "_id" : 2,
   "item" : "pecans",
   "price" : 20,
   "quantity" : 1,
   "inventory_docs" : [
      { "_id" : 4, "sku" : "pecans", "description" : "product 4", "instock" : 70 }
   ]
}
{
   "_id" : 3,
   "inventory_docs" : [
      { "_id" : 5, "sku" : null, "description" : "Incomplete" },
      { "_id" : 6 }
   ]
}

Conclusion

This was just a basic overview of using SQL-like queries in MongoDB.

There is a lot more that can be done using many other stages that are available. The best way to have a good grasp of it is by practicing different scenarios and using them in your projects. I hope this will help.


Resources

MongoDB Documentation