This recipe involves running the MapReduce job on the cloud using AWS. You will need an AWS account in order to proceed. Register to AWS at http://aws.amazon.com/.We will see how to run a MapReduce job on the cloud using Amazon Elastic MapReduce (EMR). Amazon EMR is a managed MapReduce service provided by Amazon on the cloud. For more details, refer to https://aws.amazon.com/elasticmapreduce/. Amazon EMR requires the data, binaries/jars
, and so on to be present in the S3 bucket that it processes. It then writes the results back to the S3 bucket. Amazon Simple Storage Service (S3) is another service by AWS for data storage on the cloud. For more details on Amazon S3, refer to http://aws.amazon.com/s3/. Though we will use the mongo-hadoop
connector, an interesting fact is that we won't require a MongoDB instance to be up and running. We will use the MongoDB data dump stored in an S3 bucket and use it for our data analysis. The MapReduce program will run on the input BSON dump and generate the resulting BSON dump in the output bucket. The logs of the MapReduce program will be written to another bucket dedicated to logs. The following diagram gives us an idea of how our setup will look like at a high level:
We will use the same Java sample for this recipe as the one we used in the Writing our first Hadoop MapReduce job recipe. To know more about the mapper
and reducer
class implementation, refer to the How it works… section of the Writing our first Hadoop MapReduce job recipe. We have a mongo-hadoop-emr-test
project available with the code that can be downloaded from the book's website; this code is used to create a MapReduce job on the cloud using AWS EMR APIs.
To simplify things, we will upload just one JAR file to the S3 bucket to execute the MapReduce job. This JAR file will be assembled using a BAT file for Windows and a shell script on Unix-based operating systems. The mongo-hadoop-emr-test
Java project has a mongo-hadoop-emr-binaries
subdirectory that contains the necessary binaries along with the scripts to assemble them into one JAR file. The assembled JAR file named mongo-hadoop-emr-assembly.jar
is also provided in the subdirectory. Running the .bat
or .sh
file will delete this JAR file and regenerate the assembled JAR file; it is not mandatory to do this. The assembled JAR file that is already provided is good enough and will work just fine. The Java project contains a data subdirectory with a postalCodes.bson
file in it. This is the BSON dump generated out of the database that contains the postalCodes
collection. The mongodump
utility provided with the Mongo distribution is used to extract this dump.
com.packtpub.mongo.cookbook.emr-in
. Remember that the name of the bucket has to be unique across all the S3 buckets; otherwise, you will not be able to create a bucket with this very name. You will have to create one with a different name and use it in place of com.packtpub.mongo.cookbook.emr-in
used in this recipe..bson
file for the data to the newly created (or existing) S3 bucket. To upload the files, we will use the AWS web console. Click on the Upload button and select the assembled JAR file and the postalCodes.bson
file to be uploaded on the S3 bucket. After upload, the contents of the bucket will look like the following screenshot:The following steps are to initiate the EMR job from the AWS console without writing a single line of code. We will also see how to initiate the same using the AWS Java SDK. Follow steps 4 to 9 if you are looking to initiate the EMR job from the AWS console. Follow steps 10 and 11 to start the EMR job using the Java SDK.
com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat -Dmongo.job.mapper=com.packtpub.mongo.cookbook.TopStatesMapper -Dmongo.job.reducer=com.packtpub.mongo.cookbook.TopStateReducer -Dmongo.job.output=org.apache.hadoop.io.Text -Dmongo.job.output.value=org.apache.hadoop.io.IntWritable -Dmongo.job.output.value=org.apache.hadoop.io.IntWritable -Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat -Dmapred.input.dir=s3://com.packtpub.mongo.cookbook.emr-in/postalCodes.bson -Dmapred.output.dir=s3://com.packtpub.mongo.cookbook.emr-out/
EMRTest
project provided with the code samples into your favorite IDE. Once imported, open the com.packtpub.mongo.cookbook.AWSElasticMapReduceEntrypoint
class.https://console.aws.amazon.com/s3
, and visit the out bucket set. In my case, the following is the content of the out bucket:The part-r-0000.bson
file interests us. This file contains the results of our MapReduce job.
mongorestore
utility as follows. Note that the restore utility for the following command expects a mongod
instance to be up and running and listening to port 27017
and the part-r-0000.bson
file in the current directory:$ mongorestore part-r-00000.bson -d test -c mongoEMRResults
mongod
instance using the Mongo shell and execute the following query:> db.mongoEMRResults.find().sort({count:-1}).limit(5) { "_id" : "Maharashtra", "count" : 6446 } { "_id" : "Kerala", "count" : 4684 } { "_id" : "Tamil Nadu", "count" : 3784 } { "_id" : "Andhra Pradesh", "count" : 3550 } { "_id" : "Karnataka", "count" : 3204 }
Amazon EMR is a managed Hadoop service that takes care of hardware provisioning and keeps you away from the hassle of setting up your own cluster. The concepts related to our MapReduce program are already covered in the Writing our first Hadoop MapReduce job recipe, and there is nothing additional to mention. One thing we did was to assemble the JARs that we need into one big fat JAR to execute our MapReduce job. This approach is OK for our small MapReduce job. In the case of larger jobs where a lot of third-party JARs are needed, we will have to go for an approach where we will add the JARs to the lib
directory of the Hadoop installation and execute it in the same way we did in our MapReduce job that we executed locally. Another thing that we did differently from what we did in our local setup was to not use a mongod
instance to source the data and write the data; instead, we used BSON dump files from the Mongo database as an input and write the output to BSON files. The output dump will then be imported to a Mongo database locally, and the results will be analyzed. It is pretty common to have data dumps uploaded to S3 buckets; thus, running analytics jobs on this data uploaded to S3 on the cloud using cloud infrastructure is a good option. The data accessed from the buckets by the EMR cluster need not have public access, as the EMR job runs using our account's credentials; we are good to access our own buckets to read and write data/logs.
enron
dataset given as part of the mongo-hadoop
connector's examples. You might choose to implement this example on Amazon EMR as per the given instructions.