Executing MapReduce in Mongo using PyMongo

In our previous recipe, Implementing aggregation in Mongo using PyMongo, we saw how to execute aggregation operations in Mongo using PyMongo. In this recipe, we will work on the same use case as we did for the aggregation operation but we will use MapReduce. The intent is to aggregate the data based on the state names and get the top five state names by the number of documents that they appear in.

Programming language drivers provide us with an interface to invoke the map reduce jobs written in JavaScript on the server.

Getting ready

To execute the map reduce operations, we need to have a server up and running. A simple single node is what we need. Refer to the Installing single node MongoDB recipe from Chapter 1, Installing and Starting the Server for instructions on how to start the server. The data that we will operate on needs to be imported in the database. The steps to import the data are mentioned in the Creating test data recipe in Chapter 2, Command-line Operations and Indexes. Additionally, refer to the Connecting to a single node using Python client recipe in Chapter 1, Installing and Starting the Server on how to install PyMongo for your host operating system.

How to do it…

  1. Open the Python terminal by typing the following on the command prompt:
    >>>python
    
  2. Once the Python shell opens, import the bson package as follows:
    >>> import bson
    
  3. Import the pymongo package as follows:
    >>> import pymongo
    
  4. Create an of MongoClient as follows:
    >>> client = pymongo.MongoClient('mongodb://localhost:27017')
    
  5. Get the test database's object as follows:
    >>> db = client.test
    
  6. Write the following mapper function:
    >>>  mapper = bson.Code('''function() {emit(this.state, 1)}''')
    
  7. Write the following reducer function:
    >>>  reducer = bson.Code('''function(key, values){return Array.sum(values)}''')
    
  8. Invoke map reduce; the result will be sent to the pymr_out collection:
    >>>  db.postalCodes.map_reduce(map=mapper, reduce=reducer, out='pymr_out')
    
  9. Verify the result as follows:
    >>>  c = db.pymr_out.find(sort=[('value', pymongo.DESCENDING)], limit=5)
    >>> for elem in c:
    ...     print elem
    ...
    {u'_id': u'Maharashtra', u'value': 6446.0}
    {u'_id': u'Kerala', u'value': 4684.0}
    {u'_id': u'Tamil Nadu', u'value': 3784.0}
    {u'_id': u'Andhra Pradesh', u'value': 3550.0}
    {u'_id': u'Karnataka', u'value': 3204.0}
    >>>
    

How it works…

Apart from the regular import for pymongo, here we import the bson package as well. This is where we have the Code class; it is the Python object that we use for the JavaScript map and reduce functions. It is instantiated by passing the JavaScript function body as a constructor argument.

Once two instances of the Code class are instantiated, one for map and the other for reduce, all we do is invoke the map_reduce function on the collection. In this case, we passed three parameters: two Code instances for the map and reduce functions with parameter names map and reduce, respectively and one string value used to provide the name of the output collection that the results are written to.

We won't be explaining the map reduce JavaScript functions here but it is pretty simple, and all it does is emit keys as the names of the states and values that are the number of times the particular state name occurs. This result document with the key used, the state's name as the _id field, and another field called value that is the sum of the times the particular state's name given in the _id field appears in the collection is added to the output collection, pymr_out. For example, in the entire collection, the state Maharashtra appeared 6446 times, thus the document for the state of Maharashtra is {u'_id': u'Maharashtra', u'value': 6446.0}. To verify that the result is correct, you can execute the following query in the mongo shell and see that the result is indeed 6446:

> db.postalCodes.count({state:'Maharashtra'})
6446

We are still not done as the requirement is to find the top five states by their occurrence in the collection; we still have just the states and their occurrences, so the final step is to sort the documents by the value field, which is the number of times the state's name occurred in descending order and limit the result to five documents.

See also

Refer to Chapter 8, Integration with Hadoop for different recipes on executing map reduce jobs in MongoDB using the Hadoop connector. This allows us to write the map and reduce functions in languages such as Java, Python, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset