In our previous recipe, Implementing aggregation in Mongo using PyMongo, we saw how to execute aggregation operations in Mongo using PyMongo. In this recipe, we will work on the same use case as we did for the aggregation operation but we will use MapReduce. The intent is to aggregate the data based on the state names and get the top five state names by the number of documents that they appear in.
Programming language drivers provide us with an interface to invoke the map reduce jobs written in JavaScript on the server.
To execute the map reduce operations, we need to have a server up and running. A simple single node is what we need. Refer to the Installing single node MongoDB recipe from Chapter 1, Installing and Starting the Server for instructions on how to start the server. The data that we will operate on needs to be imported in the database. The steps to import the data are mentioned in the Creating test data recipe in Chapter 2, Command-line Operations and Indexes. Additionally, refer to the Connecting to a single node using Python client recipe in Chapter 1, Installing and Starting the Server on how to install PyMongo for your host operating system.
>>>python
bson
package as follows:>>> import bson
pymongo
package as follows:>>> import pymongo
MongoClient
as follows:>>> client = pymongo.MongoClient('mongodb://localhost:27017')
>>> db = client.test
mapper
function:>>> mapper = bson.Code('''function() {emit(this.state, 1)}''')
reducer
function:>>> reducer = bson.Code('''function(key, values){return Array.sum(values)}''')
pymr_out
collection:>>> db.postalCodes.map_reduce(map=mapper, reduce=reducer, out='pymr_out')
>>> c = db.pymr_out.find(sort=[('value', pymongo.DESCENDING)], limit=5) >>> for elem in c: ... print elem ... {u'_id': u'Maharashtra', u'value': 6446.0} {u'_id': u'Kerala', u'value': 4684.0} {u'_id': u'Tamil Nadu', u'value': 3784.0} {u'_id': u'Andhra Pradesh', u'value': 3550.0} {u'_id': u'Karnataka', u'value': 3204.0} >>>
Apart from the regular import for pymongo
, here we import the bson
package as well. This is where we have the Code
class; it is the Python
object that we use for the JavaScript map
and reduce
functions. It is instantiated by passing the JavaScript function body as a constructor argument.
Once two instances of the Code
class are instantiated, one for map
and the other for reduce
, all we do is invoke the map_reduce
function on the collection. In this case, we passed three parameters: two Code
instances for the map
and reduce
functions with parameter names map
and reduce
, respectively and one string value used to provide the name of the output collection that the results are written to.
We won't be explaining the map reduce JavaScript functions here but it is pretty simple, and all it does is emit keys as the names of the states and values that are the number of times the particular state name occurs. This result document with the key used, the state's name as the _id
field, and another field called value that is the sum of the times the particular state's name given in the _id
field appears in the collection is added to the output collection, pymr_out
. For example, in the entire collection, the state Maharashtra
appeared 6446
times, thus the document for the state of Maharashtra is {u'_id': u'Maharashtra', u'value': 6446.0}
. To verify that the result is correct, you can execute the following query in the mongo shell and see that the result is indeed 6446
:
> db.postalCodes.count({state:'Maharashtra'}) 6446
We are still not done as the requirement is to find the top five states by their occurrence in the collection; we still have just the states and their occurrences, so the final step is to sort the documents by the value field, which is the number of times the state's name occurred in descending order and limit the result to five documents.
Refer to Chapter 8, Integration with Hadoop for different recipes on executing map reduce jobs in MongoDB using the Hadoop connector. This allows us to write the map
and reduce
functions in languages such as Java, Python, and so on.