Running a MapReduce job on Amazon EMR

This recipe involves running the MapReduce job on the cloud using AWS. You will need an AWS account in order to proceed. Register to AWS at http://aws.amazon.com/.We will see how to run a MapReduce job on the cloud using Amazon Elastic MapReduce (EMR). Amazon EMR is a managed MapReduce service provided by Amazon on the cloud. For more details, refer to https://aws.amazon.com/elasticmapreduce/. Amazon EMR requires the data, binaries/jars, and so on to be present in the S3 bucket that it processes. It then writes the results back to the S3 bucket. Amazon Simple Storage Service (S3) is another service by AWS for data storage on the cloud. For more details on Amazon S3, refer to http://aws.amazon.com/s3/. Though we will use the mongo-hadoop connector, an interesting fact is that we won't require a MongoDB instance to be up and running. We will use the MongoDB data dump stored in an S3 bucket and use it for our data analysis. The MapReduce program will run on the input BSON dump and generate the resulting BSON dump in the output bucket. The logs of the MapReduce program will be written to another bucket dedicated to logs. The following diagram gives us an idea of how our setup will look like at a high level:

Running a MapReduce job on Amazon EMR

Getting ready

We will use the same Java sample for this recipe as the one we used in the Writing our first Hadoop MapReduce job recipe. To know more about the mapper and reducer class implementation, refer to the How it works… section of the Writing our first Hadoop MapReduce job recipe. We have a mongo-hadoop-emr-test project available with the code that can be downloaded from the book's website; this code is used to create a MapReduce job on the cloud using AWS EMR APIs.

To simplify things, we will upload just one JAR file to the S3 bucket to execute the MapReduce job. This JAR file will be assembled using a BAT file for Windows and a shell script on Unix-based operating systems. The mongo-hadoop-emr-test Java project has a mongo-hadoop-emr-binaries subdirectory that contains the necessary binaries along with the scripts to assemble them into one JAR file. The assembled JAR file named mongo-hadoop-emr-assembly.jar is also provided in the subdirectory. Running the .bat or .sh file will delete this JAR file and regenerate the assembled JAR file; it is not mandatory to do this. The assembled JAR file that is already provided is good enough and will work just fine. The Java project contains a data subdirectory with a postalCodes.bson file in it. This is the BSON dump generated out of the database that contains the postalCodes collection. The mongodump utility provided with the Mongo distribution is used to extract this dump.

How to do it…

  1. The first step of this exercise is to create a bucket on S3. You might choose to use an existing bucket. However, for this recipe, I am creating a bucket named com.packtpub.mongo.cookbook.emr-in. Remember that the name of the bucket has to be unique across all the S3 buckets; otherwise, you will not be able to create a bucket with this very name. You will have to create one with a different name and use it in place of com.packtpub.mongo.cookbook.emr-in used in this recipe.

    Tip

    Do not create bucket names with an underscore (_); use a hyphen (-) instead. Bucket creation with an underscore will not fail, but the MapReduce job will fail later as it doesn't accept underscores in the bucket names.

  2. We will upload the assembled JAR files and a .bson file for the data to the newly created (or existing) S3 bucket. To upload the files, we will use the AWS web console. Click on the Upload button and select the assembled JAR file and the postalCodes.bson file to be uploaded on the S3 bucket. After upload, the contents of the bucket will look like the following screenshot:
    How to do it…

The following steps are to initiate the EMR job from the AWS console without writing a single line of code. We will also see how to initiate the same using the AWS Java SDK. Follow steps 4 to 9 if you are looking to initiate the EMR job from the AWS console. Follow steps 10 and 11 to start the EMR job using the Java SDK.

  1. We will first initiate a MapReduce job from the AWS console. Visit https://console.aws.amazon.com/elasticmapreduce/ and click on the Create Cluster button. In the Cluster Configuration screen, enter the details shown in the following screenshot, except for the logging bucket. You will need to select the bucket to which the logs need to be written. You might also click on the folder icon next to the textbox for the bucket name and select the bucket present for your account to be used as the logging bucket, as shown in the following screenshot:
    How to do it…
  2. The Termination protection option is set to No, as this is a test instance. In the case of any error, we would prefer the instances to terminate to avoid keeping them running and incur charges.
  3. In the Software Configuration section, select the Hadoop version as 2.4.0 and AMI version as 3.1.0. Remove the additional applications by clicking on the cross next to their names, as shown in the following screenshot:
    How to do it…
  4. In the Hardware Configuration section, select the EC2 instance type as m1.medium. This is the minimum we need to select for Hadoop Version 2.4.0. The number of instances for the slave and task instances is zero. The following screenshot shows the configuration selected:
    How to do it…
  5. In the Security and Access section, leave all the default values. We also have no need for a Bootstrap Action, so leave this as is too.
  6. The next step is to set up steps for the MapReduc job. In the Add step drop-down menu, select the Custom JAR option and select the Auto-terminate option to Yes, as shown in the following screenshot:
    How to do it…
  7. Now, click on the Configure and add button and enter the details.
  8. The value of the JAR S3 Location field is given as s3://com.packtpub.mongo.cookbook.emr-in/mongo-hadoop-emr-assembly.jar. This is the location in my input bucket; you need to change the input bucket as per your own input bucket. The name of the JAR file will be same.
  9. Enter the following arguments in the Arguments text area. The name of the main class is first in the list:
    com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint
    -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
    -Dmongo.job.mapper=com.packtpub.mongo.cookbook.TopStatesMapper
    -Dmongo.job.reducer=com.packtpub.mongo.cookbook.TopStateReducer
    -Dmongo.job.output=org.apache.hadoop.io.Text
    -Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
    -Dmongo.job.output.value=org.apache.hadoop.io.IntWritable
    -Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
    -Dmapred.input.dir=s3://com.packtpub.mongo.cookbook.emr-in/postalCodes.bson
    -Dmapred.output.dir=s3://com.packtpub.mongo.cookbook.emr-out/
  10. Again, the value of the final two arguments contains the input and output buckets used for my MapReduce sample. This value will change according to your own input and output buckets. The value of Action on failure will be Terminate cluster. The following screenshot shows the values filled. Click on Save after all the preceding details are entered in:
    How to do it…
  11. Now, click on the Create Cluster button. This will take some time to provision and start the cluster.
  12. In the following few steps, we will create a MapReduce job on EMR using the AWS Java API. Import the EMRTest project provided with the code samples into your favorite IDE. Once imported, open the com.packtpub.mongo.cookbook.AWSElasticMapReduceEntrypoint class.
  13. There are five constants that need to be changed in the class. They are input; output; log bucket, which you will use for your example; the EC2 key name; the AWS access; and the secret key. The access key and secret key act as the user name and password, respectively, when you use AWS SDK. Change these values accordingly and run the program; on successful execution, it should give you a job ID for the newly initiated job.
  14. Irrespective of how you initiated the EMR job, visit the EMR console at https://console.aws.amazon.com/elasticmapreduce/ to see the status of your submitted ID. The job ID you see in the second column of your initiated jobs will be the same as the job ID printed to the console when you executed the Java program (if you initiated the job using the Java program). Click on the name of the job initiated; this should navigate you to the job-details page. The hardware provisioning will take some time and then, finally, your MapReduce step will run. Once the job is complete, the status of the job will look like the following screenshot:
    How to do it…
  15. When the Steps section is expanded, it will look like the following screenshot:
    How to do it…
  16. Click on the stderr link below the Log files section to view all the logs' output for the MapReduce job.
  17. Now that the MapReduce job is complete, our next step is to see the results of it. Visit the S3 console at https://console.aws.amazon.com/s3, and visit the out bucket set. In my case, the following is the content of the out bucket:
    How to do it…

    The part-r-0000.bson file interests us. This file contains the results of our MapReduce job.

  18. Download the file to your local filesystem and import it into a running Mongo instance locally using the mongorestore utility as follows. Note that the restore utility for the following command expects a mongod instance to be up and running and listening to port 27017 and the part-r-0000.bson file in the current directory:
    $ mongorestore part-r-00000.bson -d test -c mongoEMRResults
    
  19. Now, connect to the mongod instance using the Mongo shell and execute the following query:
    > db.mongoEMRResults.find().sort({count:-1}).limit(5)
    { "_id" : "Maharashtra", "count" : 6446 }
    { "_id" : "Kerala", "count" : 4684 }
    { "_id" : "Tamil Nadu", "count" : 3784 }
    { "_id" : "Andhra Pradesh", "count" : 3550 }
    { "_id" : "Karnataka", "count" : 3204 }
    
  20. The preceding command shows the top five results. If we compare the results we got in Chapter 3, Programming Language Drivers, for using Mongo's MapReduce framework and the Writing our first Hadoop MapReduce job recipe in this chapter, we will see that the results are identical.

How it works…

Amazon EMR is a managed Hadoop service that takes care of hardware provisioning and keeps you away from the hassle of setting up your own cluster. The concepts related to our MapReduce program are already covered in the Writing our first Hadoop MapReduce job recipe, and there is nothing additional to mention. One thing we did was to assemble the JARs that we need into one big fat JAR to execute our MapReduce job. This approach is OK for our small MapReduce job. In the case of larger jobs where a lot of third-party JARs are needed, we will have to go for an approach where we will add the JARs to the lib directory of the Hadoop installation and execute it in the same way we did in our MapReduce job that we executed locally. Another thing that we did differently from what we did in our local setup was to not use a mongod instance to source the data and write the data; instead, we used BSON dump files from the Mongo database as an input and write the output to BSON files. The output dump will then be imported to a Mongo database locally, and the results will be analyzed. It is pretty common to have data dumps uploaded to S3 buckets; thus, running analytics jobs on this data uploaded to S3 on the cloud using cloud infrastructure is a good option. The data accessed from the buckets by the EMR cluster need not have public access, as the EMR job runs using our account's credentials; we are good to access our own buckets to read and write data/logs.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset