This recipe involves running the MapReduce job on the cloud using AWS. You will need an AWS account in order to proceed. Register with AWS at http://aws.amazon.com/. We will see how to run a MapReduce job on the cloud using Amazon Elastic Map Reduce (Amazon EMR). Amazon EMR is a managed MapReduce service provided by Amazon on the cloud. Refer to https://aws.amazon.com/elasticmapreduce/ for more details. Amazon EMR consumes data, binaries/JARs, and so on from AWS S3 bucket, processes them and writes the results back to S3 bucket. Amazon Simple Storage Service (Amazon S3) is another service by AWS for data storage on the cloud. Refer to http://aws.amazon.com/s3/ for more details on Amazon S3. Though we will use the mongo-hadoop connector, an interesting fact is that we won't require a MongoDB instance to be up and running. We will use the MongoDB data dump stored in an S3 bucket for our data analysis. The MapReduce program will run on the input BSON dump and generate the result BSON dump in the output bucket. The logs of the MapReduce program will be written to another bucket dedicated for logs. The following figure gives us an idea of how our setup would look at a high level:
We will use the same Java sample as we did in the Writing our first Hadoop MapReduce job recipe for this recipe. To know more about the mapper
and reducer
class implementation, you can refer to the How It works section of the same recipe. We have a mongo-hadoop-emr-test
project available with the code that can be downloaded from the Packt website, which is used to create a MapReduce job on the cloud using the AWS EMR APIs. To simplify things, we will upload just one JAR to the S3 bucket to execute the MapReduce job. This JAR will be assembled using a BAT file for Windows and a shell script on Unix-based operating systems. The mongo-hadoop-emr-test
Java project has the mongo-hadoop-emr-binaries
subdirectory containing the necessary binaries along with the scripts to assemble them in one JAR.
The assembled mongo-hadoop-emr-assembly.jar
file is also provided in the subdirectory. Running the .bat
or .sh
file will delete this JAR and regenerate the assembled JAR, which is not mandatory. The already provided assembled JAR is good enough and will work just fine. The Java project contains subdirectory data with a postalCodes.bson
file in it. This is the BSON dump generated out of the database containing the postalCodes
collection. The mongodump
utility provided with the mongo distribution is used to extract this dump.
com.packtpub.mongo.cookbook.emr-in
bucket. Remember that the name of the bucket has to be unique across all the S3 buckets and you will not be able to create a bucket with this very name. You will have to create one with a different name and use it in place of com.packtpub.mongo.cookbook.emr-in
that is used in this recipe..bson
file for the data to the newly created (or existing) S3 bucket. To upload the files, we will use the AWS web console. Click on the Upload button and select the assembled JAR file and the postalCodes.bson
file to be uploaded to the S3 bucket. After uploading, the contents of the bucket should look as follows:Now click on the Configure and Add button and enter the details.
The value of the JAR S3 Location is given as s3://com.packtpub.mongo.cookbook.emr-in/mongo-hadoop-emr-assembly.jar
. This is the location in my input bucket; you need to change the input bucket as per your own input bucket. The name of the JAR file would be same.
Enter the following arguments in the Arguments text area; the name of the main class is the first in the list:
com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint
-Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
-Dmongo.job.mapper=com.packtpub.mongo.cookbook.TopStatesMapper
-Dmongo.job.reducer=com.packtpub.mongo.cookbook.TopStateReducer
-Dmongo.job.output=org.apache.hadoop.io.Text
-Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
-Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
-Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
-Dmapred.input.dir=s3://com.packtpub.mongo.cookbook.emr-in/postalCodes.bson
-Dmapred.output.dir=s3://com.packtpub.mongo.cookbook.emr-out/
EMRTest
project provided with the code samples to your favorite IDE. Once imported, open the com.packtpub.mongo.cookbook.AWSElasticMapReduceEntrypoint
class.When expanded, the Steps section should look as follows:
The part-r-0000.bson
file is of our interest. This file contains the results of our MapReduce job.
27017
with the part-r-0000.bson
file in the current directory:$ mongorestore part-r-00000.bson -d test -c mongoEMRResults
mongod
instance using the mongo shell and execute the following query:> db.mongoEMRResults.find().sort({count:-1}).limit(5)
We will see the following results for the query:
{ "_id" : "Maharashtra", "count" : 6446 } { "_id" : "Kerala", "count" : 4684 } { "_id" : "Tamil Nadu", "count" : 3784 } { "_id" : "Andhra Pradesh", "count" : 3550 } { "_id" : "Karnataka", "count" : 3204 }
Amazon EMR is a managed Hadoop service that takes care of the hardware provisioning and keeps you away from the hassle of setting up your own cluster. The concepts related to our MapReduce program have already been covered in the Writing our first Hadoop MapReduce job recipe and there is nothing more to mention. One thing that we did was to assemble the JARs that we need in one big fat JAR to execute our MapReduce job. This approach is okay for our small MapReduce job; in case of larger jobs where a lot of third-party JARs are needed, we will have to go for an approach where we will add the JARs to the lib
directory of the Hadoop installation and execute in the same way as we did in our MapReduce job that we executed locally. Another thing that we did differently from our local setup was not to use a mongid
instance to source the data and write the data to, but instead, we used the BSON dump files from the mongo database as an input and wrote the output to the BSON files. The output dump will then be imported to a mongo database locally and the results will be analyzed. It is pretty common to have the data dumps uploaded to S3 buckets, and running analytics jobs on this data that has been uploaded to S3 on the cloud using cloud infrastructure is a good option. The data accessed from the buckets by the EMR cluster need not have public access as the EMR job runs using our account's credentials; we are good to access our own buckets to read and write data/logs.
After trying out this simple MapReduce job, it is highly recommended that you get to know about the Amazon EMR service and all its features. The developer's guide for EMR can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/.
There is a sample MapReduce job in the Enron dataset given as part of the mongo-hadoop connector's examples. It can be found at https://github.com/mongodb/mongo-hadoop/tree/master/examples/elastic-mapreduce. You can choose to implement this example as well on Amazon EMR as per the given instructions.