Aggregating Data

Aggregation Pipeline

Aggregation pipeline gives you a way to transform and combine documents in your collection. You do it by passing the documents through a pipeline that's somewhat analogous to the Unix "pipe" where you send output from one command to another to a third, etc.

The simplest aggregation you are probably already familiar with is the SQL group by expression. We already saw the simple count() method, but what if we want to see how many unicorns are male and how many are female?

db.unicorns.aggregate([{$group:{_id:'$gender',
    total: {$sum:1}}}])

In the shell we have the aggregate helper which takes an array of pipeline operators. For a simple count grouped by something, we only need one such operator and it's called $group. This is the exact analog of GROUP BY in SQL where we create a new document with _id field indicating what field we are grouping by (here it's gender) and other fields usually getting assigned results of some aggregation, in this case we $sum 1 for each document that matches a particular gender. You probably noticed that the _id field was assigned '$gender' and not 'gender' - the '$' before a field name indicates that the value of this field from incoming document will be substituted.

What are some of the other pipeline operators that we can use? The most common one to use before (and frequently after) $group would be $match - this is exactly like the find method and it allows us to aggregate only a matching subset of our documents, or to exclude some documents from our result.

db.unicorns.aggregate([{$match: {weight:{$lt:600}}},
    {$group: {_id:'$gender',  total:{$sum:1},
      avgVamp:{$avg:'$vampires'}}},
    {$sort:{avgVamp:-1}} ])

Here we introduced another pipeline operator $sort which does exactly what you would expect, along with it we also get $skip and $limit. We also used a $group operator $avg.

MongoDB arrays are powerful and they don't stop us from being able to aggregate on values that are stored inside of them. We do need to be able to "flatten" them to properly count everything:

db.unicorns.aggregate([{$unwind:'$loves'},
     {$group: {_id:'$loves',  total:{$sum:1},
     unicorns:{$addToSet:'$name'}}},
      {$sort:{total:-1}},
      {$limit:1} ])

Here we will find out which food item is loved by the most unicorns and we will also get the list of names of all the unicorns that love it. $sort and $limit in combination allow you to get answers to "top N" types of questions.

There is another powerful pipeline operator called $project (analogous to the projection we can specify to find) which allows you not just to include certain fields, but to create or calculate new fields based on values in existing fields. For example, you can use math operators to add together values of several fields before finding out the average, or you can use string operators to create a new field that's a concatenation of some existing fields.

This just barely scratches the surface of what you can do with aggregations. In 2.6 aggregation got more powerful as the aggregate command returns either a cursor to the result set (which you already know how to work with from Chapter 1) or it can write your results into a new collection using the $out pipeline operator. You can see a lot more examples as well as all of the supported pipeline and expression operators in the MongoDB manual.

MapReduce

MapReduce is a two-step approach to data processing. First you map, and then you reduce. The mapping step transforms the inputted documents and emits a key=>value pair (the key and/or value can be complex). Then, key/value pairs are grouped by key, such that values for the same key end up in an array. The reduce gets a key and the array of values emitted for that key, and produces the final result. The map and reduce functions are written in JavaScript.

With MongoDB we use the mapReduce command on a collection. mapReduce takes a map function, a reduce function and an output directive. In our shell we can create and pass a JavaScript function. From most libraries you supply a string of your functions (which is a bit ugly). The third parameter sets additional options, for example we could filter, sort and limit the documents that we want analyzed. We can also supply a finalize method to be applied to the results after the reduce step.

You probably won't need to use MapReduce for most of your aggregations, but if you do, you can read more about it on my blog and in MongoDB manual.

Summary

In this chapter we covered MongoDB's aggregation capabilities. Aggregation Pipeline is relatively simple to write once you understand how it's structured and it's a powerful way to group data. MapReduce is more complicated to understand, but its capabilities can be as boundless as any code you can write in JavaScript.

results matching ""

    No results matching ""