mongodb-aggregate-framework

Understand MongoDB’s aggregation framework

If you are a “regular” SQL or MySQL user, MongoDB’s aggregation framework may be a little obscure.

In this article, I’ll show you the key to understand it.

I’ll suppose you already have MongoDB installed on your system, and that you’re using Mongo’s shell and understand MongoDB’s language (document, collection…).

Definition of the aggregation framework

The definition of the framework, according MongoDB’s documentation:

Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.

This means that the aggregation framework may be use like MySQL’s GROUP BY or DISTINCT (but not only).

Understand the framework

The most important thing to understand, and then, to be able to build your own queries,  is how MongoDB read the query.

Here an example of a simple aggregation:

First of all, this aggregate is absolutely useless. It’s the equivalent of a simple find() query:

Or, in MySQL:

But, it’s not the point here. For now, let’s focus on how the aggregation’s query is read by MongoDB.

This aggregate returns all documents where event_id equal 11. But not only.

(Actually, you certainly noticed that my MySQL transcript of this query is not totally accurate. Indeed, as we’ll see later, this aggregate returns all the document field except _id field. In MySQL, it’s not very simple to do that.)

Explanation

Let’s go back to our aggregate.

The first block is a $match

This is the equivalent of a WHERE in MySQL. The this block will return all documents where event_id11. The aggregate could stop here and be valid; but then, it wouldn’t be very interesting.

The second block is much more interesting and usefull:

The $unwind must be use on arrays, as it unfold them. But it will unfold only the document matching precedent blocks (in this case, the $match). And this is the “key” to understand aggregates. It’s that each block pass its result to the next one.

Then, in our example, the next block is a $project where we set which field from the document we keep or remove, but only for the documents sent by precedent blocks: $match and $unwind.

It’s very important to understand that way of working. You have to think of what you’ll pass on the next block. It’s like a waterfall, if you block or derive half of the content, you’ll be able to work at the end only to the content you let through. To help you understand why an aggregate does not work as it should, you can run your aggregate step-by-step, adding one block after another and see which one break your query.

This is why you may chain a $group after a $match, then another $group.

Links

Published by

Constantin Guay

Currently Data Project Manager and Scrum Evangelist at NetMediaEurope, european leader in IT B2B news site (more than 12 million unique visitors per month). Passionate by user experience and data science. Scrum evangelist

Leave a Reply

Your email address will not be published. Required fields are marked *