}

Introduction to MapReduce with MongoDB

Created:

Introduction

MapReduce is a programming model and an associated implementation for processing and generating large data sets. To use MapReduce the user need to define a map function which takes a key/value pair and produces an intermediate key/value pair, later a reduce function merges the intermediate results of the same key to produce the final result.

Google published on 2004 a very important paper about MapReduce, which made MapReduce very popular today.

When you use MapReduce to compute you can exploit a large cluster of computers. This allows you to create parallel and distribuited systems with simplicity. Using MapReduce allows you to process many terabytes of data and to create summaries, inverted search, graph representations

In this tutorial we are going to use python and mongoengine and pymongo to explian how to use MapReduce with MongoDB.

Using MapReduce with MongoDB

First we need to understand how our objects are made, as an example check the following json:

{
    _id: 1, 
    url: "www.example.com",
    category: "Website type 1",
    words:[
        "example",
        "site",
        "good"
    ]
},
{
    _id: 2, 
    url: "test.com",
    category: "Website type 1",
    words:[
        "awesome",
        "life",
        "good"
    ]
}

We want to give a score to each website using the words. Each word will have a score and we want for each document the sum of all words.

This is how the data will be converted during MapReduce. First the map function will select the _id and words fields:

{
category: "Website type 1",
scores: [101, 1]}

The result will containt a list of score since out date contain two sites with the category "Website type 1". The map function will sum 101 for the first value, since example score is 100, site score is 1 and good score is 0. For the second value "life" has 1 as score.

Then the reduce function will output:

{
category: "Website type 1",
scores: 102}

102 is the total sum of all values for the "Website type 1" category.

Let's see the map function and reduce function code.

The Map Function

For MongoDB we are going to write javascript code. We need out map function to return the score of each document. For this we are going to use "computeScores" function which uses words to calculate the score.

var map = function () { 
    emit(this.category, { scores: computeScores(this.words) });
}

You can think that the first parameter of the emit function is the field to "group by" the data.

The Reduce Funtion

The reduce function is very simple. It will receive for each key of the intermediate result an array of values and we need to sum those values.

var reduce = function(key,values)
 {
     var reduced = {        
         score:0
     }

     for (var i=0; i < values.length;i++)
     {
        reduced.score+=values[i];        
     }   

     return reduced;
 }

MapReduce query on MongoDB shell examples

Here is an example of how to use MapReduce from the mongo shell:

map = function () { 
    emit(this.category, { scores: computeScores(this.words) });
}
reduce = function(key,values)
 {
     var reduced = {        
         score:0
     }

     for (var i=0; i < values.length;i++)
     {
        reduced.score+=values[i];        
     }   

     return reduced;
 }
db.collection.mapReduce(
   map,
   reduce,
   {
      out: collection,
   }
)

The first parameter is the map function, the second is the reduce function. Then you need to specify out which is the location of the result.

Optional parameters: * query: If you want to filter documents the will be used for MapReduce. * sort: You can specify to the order of the result. * limit: Maximum number of results.

We are going to use python pymongo and mongoengine to do a MapReduce query, but you should be familiar with the mongodb query to understand the parameters it requires.

Implementation with pymongo

When using pymonho we need to import Code from bson. We will define the functions as instances of Code. Then we will call to map_reduce of yourcollection with the three required parameters. Results can be read with an iterator.

from bson.code import Code
map = Code("var map = function () { "
"    emit(this.category, { scores: computeScores(this.words) });"
"}"
)

result = db.yourcollection.map_reduce(map, reduce, "myresults")
for doc in result.find():
    pass

Implementation with mongoengine

Check the following mapredice mongoengine example:

map_f = """
    # put here javascript code
"""

reduce_f = """
    #put here javascript code
"""

for i in model.objects.map_reduce(map_f, reduce_f, "NewCollection"):
    pass

map_f and reduce_f corresponds to the javascript code of the map and reduce functions. If you want to merge into an already existing collection use {"merge":"COLLECTION_NAME"} instead of "NewCollection".