Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Running a MapReduce job on Amazon EMR

This recipe involves running the MapReduce job on the cloud using AWS. You will need an AWS account in order to proceed. Register with AWS at http://aws.amazon.com/. We will see how to run a MapReduce job on the cloud using Amazon Elastic Map Reduce (Amazon EMR). Amazon EMR is a managed MapReduce service provided by Amazon on the cloud. Refer to https://aws.amazon.com/elasticmapreduce/ for more details. Amazon EMR consumes data, binaries/JARs, and so on from AWS S3 bucket, processes them and writes the results back to S3 bucket. Amazon Simple Storage Service (Amazon S3) is another service by AWS for data storage on the cloud. Refer to http://aws.amazon.com/s3/ for more details on Amazon S3. Though we will use the mongo-hadoop connector, an interesting fact is that we won't require a MongoDB instance to be up and running. We will use the MongoDB data dump stored in an S3 bucket for our data analysis. The MapReduce program will run on the input BSON dump and generate the result BSON dump in the output bucket. The logs of the MapReduce program will be written to another bucket dedicated for logs. The following figure gives us an idea of how our setup would look at a high level:

Getting ready

We will use the same Java sample as we did in the Writing our first Hadoop MapReduce job recipe for this recipe. To know more about the mapper and reducer class implementation, you can refer to the How It works section of the same recipe. We have a mongo-hadoop-emr-test project available with the code that can be downloaded from the Packt website, which is used to create a MapReduce job on the cloud using the AWS EMR APIs. To simplify things, we will upload just one JAR to the S3 bucket to execute the MapReduce job. This JAR will be assembled using a BAT file for Windows and a shell script on Unix-based operating systems. The mongo-hadoop-emr-test Java project has the mongo-hadoop-emr-binaries subdirectory containing the necessary binaries along with the scripts to assemble them in one JAR.

The assembled mongo-hadoop-emr-assembly.jar file is also provided in the subdirectory. Running the .bat or .sh file will delete this JAR and regenerate the assembled JAR, which is not mandatory. The already provided assembled JAR is good enough and will work just fine. The Java project contains subdirectory data with a postalCodes.bson file in it. This is the BSON dump generated out of the database containing the postalCodes collection. The mongodump utility provided with the mongo distribution is used to extract this dump.

How to do it…

The first step of this exercise is to create a bucket on S3. You can choose to use an existing bucket; however, for this recipe, I am creating a com.packtpub.mongo.cookbook.emr-in bucket. Remember that the name of the bucket has to be unique across all the S3 buckets and you will not be able to create a bucket with this very name. You will have to create one with a different name and use it in place of com.packtpub.mongo.cookbook.emr-in that is used in this recipe.
Tip
Do not create bucket names with an underscore (_); instead, use a hyphen (-). The bucket creation with an underscore will not fail; however, the MapReduce job later will fail as it doesn't accept underscores in the bucket names.
We will upload the assembled JAR files and a .bson file for the data to the newly created (or existing) S3 bucket. To upload the files, we will use the AWS web console. Click on the Upload button and select the assembled JAR file and the postalCodes.bson file to be uploaded to the S3 bucket. After uploading, the contents of the bucket should look as follows:
The following steps are to initiate the EMR job from the AWS console without writing a single line of code. We will also see how to initiate this using AWS Java SDK. Follow steps 4 to 9 if you are looking to initiate the EMR job from the AWS console. Follow steps 10 and 11 to start the EMR job using the Java SDK.
We will first initiate a MapReduce job from the AWS console. Visit https://console.aws.amazon.com/elasticmapreduce/ and click on the Create Cluster button. In the Cluster Configuration screen, enter the details as shown in the image, except for the logging bucket, which you need to select as your bucket to which the logs need to be written. You can also click on the folder icon next to the textbox for the bucket name and select the bucket present for your account to be used as the logging bucket.
Note
The termination protection option is set to No as this is a test instance. In case of any error, we would rather want the instances to terminate in order to avoid keeping them running and incurring charges.
In the Software Configuration section, select the Hadoop version as 2.4.0 and AMI version as 3.1.0 (hadoop 2.4.0). Remove the additional applications by clicking on the cross next to their names, as shown in the following image:
In the Hardware Configuration section, select the EC2 instance type as m1.medium. This is the minimum that we need to select for the Hadoop version 2.4.0. The number of instances for the slave and task instances is zero. The following image shows the configuration that is selected:
In the Security and Access section, leave all the default values. We also have no need for a Bootstrap Action, so leave this as well.
The final step is to set up Steps for the MapReduce job. In the Add step drop down, select the Custom JAR option, and then select the Auto-terminate option as Yes, as shown in the following image:
Now click on the Configure and Add button and enter the details.
The value of the JAR S3 Location is given as s3://com.packtpub.mongo.cookbook.emr-in/mongo-hadoop-emr-assembly.jar. This is the location in my input bucket; you need to change the input bucket as per your own input bucket. The name of the JAR file would be same.
Enter the following arguments in the Arguments text area; the name of the main class is the first in the list:
com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint
-Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
-Dmongo.job.mapper=com.packtpub.mongo.cookbook.TopStatesMapper
-Dmongo.job.reducer=com.packtpub.mongo.cookbook.TopStateReducer
-Dmongo.job.output=org.apache.hadoop.io.Text
-Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
-Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
-Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
-Dmapred.input.dir=s3://com.packtpub.mongo.cookbook.emr-in/postalCodes.bson
-Dmapred.output.dir=s3://com.packtpub.mongo.cookbook.emr-out/
The value of the final two arguments contains the input and output bucket used for my MapReduce sample; this value will change according to your own input and output buckets. The value of Action on failure would be Terminate. The following image shows the values filled in; click on Save after all these details have been entered:
Now click on the Create Cluster button. This will take some time to provision and start the cluster.
In the following few steps, we will create a MapReduce job on EMR using the AWS Java API. Import the EMRTest project provided with the code samples to your favorite IDE. Once imported, open the com.packtpub.mongo.cookbook.AWSElasticMapReduceEntrypoint class.
There are five constants that need to be changed in the class. They are the Input, Output, and Log bucket that you will use for your example and the AWS access and secret key. The access key and secret key act as the username and password when you use AWS SDK. Change these values accordingly and run the program. On successful execution, it should give you a job ID for the newly initiated job.
Irrespective of how you initiated the EMR job, visit the EMR console at https://console.aws.amazon.com/elasticmapreduce/ to see the status of your submitted ID. The Job ID that you can see in the second column of your initiated job will be same as the job ID printed to the console when you executed the Java program (if you initiated using the Java program). Click on the name of the job initiated, which should direct you to the job details page. The hardware provisioning will take some time, and then finally, your map reduce step will run. Once the job has been completed, the status of the job should look as follows on the Job details screen:
When expanded, the Steps section should look as follows:
Click on the stderr link below the Log files section to view all the logs' output for the MapReduce job.
Now that the MapReduce job is complete, our next step is to see the results of it. Visit the S3 console at https://console.aws.amazon.com/s3 and visit the output bucket. In my case, the following is the content of the out bucket:
The part-r-0000.bson file is of our interest. This file contains the results of our MapReduce job.
Download the file to your local filesystem and import to a running mongo instance locally, using the mongorestore utility. Note that the restore utility for the following command expects a mongod instance to be up and running and listening to port 27017 with the part-r-0000.bson file in the current directory:
```
$ mongorestore part-r-00000.bson -d test -c mongoEMRResults
```

Now, connect to the mongod instance using the mongo shell and execute the following query:

> db.mongoEMRResults.find().sort({count:-1}).limit(5)

We will see the following results for the query:

{ "_id" : "Maharashtra", "count" : 6446 }
{ "_id" : "Kerala", "count" : 4684 }
{ "_id" : "Tamil Nadu", "count" : 3784 }
{ "_id" : "Andhra Pradesh", "count" : 3550 }
{ "_id" : "Karnataka", "count" : 3204 }

This is the expected result for the top five results. If we compare the results that we got in Executing MapReduce in Mongo using a Java client from Chapter 3, Programming Language Drivers using Mongo's MapReduce framework and the Writing our first Hadoop MapReduce job recipe in this chapter, we can see that the results are identical.

How it works…

Amazon EMR is a managed Hadoop service that takes care of the hardware provisioning and keeps you away from the hassle of setting up your own cluster. The concepts related to our MapReduce program have already been covered in the Writing our first Hadoop MapReduce job recipe and there is nothing more to mention. One thing that we did was to assemble the JARs that we need in one big fat JAR to execute our MapReduce job. This approach is okay for our small MapReduce job; in case of larger jobs where a lot of third-party JARs are needed, we will have to go for an approach where we will add the JARs to the lib directory of the Hadoop installation and execute in the same way as we did in our MapReduce job that we executed locally. Another thing that we did differently from our local setup was not to use a mongid instance to source the data and write the data to, but instead, we used the BSON dump files from the mongo database as an input and wrote the output to the BSON files. The output dump will then be imported to a mongo database locally and the results will be analyzed. It is pretty common to have the data dumps uploaded to S3 buckets, and running analytics jobs on this data that has been uploaded to S3 on the cloud using cloud infrastructure is a good option. The data accessed from the buckets by the EMR cluster need not have public access as the EMR job runs using our account's credentials; we are good to access our own buckets to read and write data/logs.

Table of Contents for
Running a MapReduce job on Amazon EMR

Running a MapReduce job on Amazon EMR

Getting ready

How to do it…

Tip

Note

How it works…

See also

Table of Contents for Running a MapReduce job on Amazon EMR

Create new playlist

Sign In

Sign Up

Running a MapReduce job on Amazon EMR

Getting ready

How to do it…

Tip

Note

How it works…

See also

Table of Contents for
Running a MapReduce job on Amazon EMR