Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)

Amazon Elastic MapReduce (EMR) provides on-demand managed Hadoop clusters in the Amazon Web Services (AWS) cloud to perform your Hadoop MapReduce computations. EMR uses Amazon Elastic Compute Cloud (EC2) instances as the compute resources. EMR supports reading input data from Amazon Simple Storage Service (S3) and storing of the output data in Amazon S3 as well. EMR makes our life easier by taking care of the provisioning of cloud instances, configuring the Hadoop cluster and the execution of our MapReduce computational flows.

In this recipe, we are going to run the WordCount MapReduce sample (Refer to the Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe from Chapter 1, Getting Hadoop up and running in a Cluster) in the Amazon EC2 cloud using Amazon Elastic MapReduce.

Getting ready

Build the required c10-samples.jar by running the Ant build in the code samples for this chapter.

How to do it...

The steps for executing WordCount MapReduce application on Amazon Elastic MapReduce are as follows:

  1. Sign up for an AWS account by visiting http://aws.amazon.com.
  2. Open the Amazon S3 monitoring console at https://console.aws.amazon.com/s3 and sign in.
  3. Create a S3 bucket to upload the input data by clicking on Create Bucket. Provide a unique name for your bucket. Let's assume the name of the bucket as wc-input-data. You can find more information on creating a S3 bucket in http://docs.amazonwebservices.com/AmazonS3/latest/gsg/CreatingABucket.html. There exist several third-party desktop clients for the Amazon S3. You can use one of those clients to manage your data in S3 as well.
  4. Upload your input data to the above-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
    How to do it...
  5. Create a S3 bucket to upload the JAR file needed for our MapReduce computation. Let's assume the name of the bucket as sample-jars. Upload the C10Samples.jar file to the newly created bucket.
  6. Create a S3 bucket to store the output data of the computation. Let's assume the name of this bucket as ws-output-data. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket as c10-logs.

    Note

    S3 bucket namespace is shared globally by all users. Hence, using the example bucket names given in this recipe might not work for you. In such scenarios, you should give your own custom names for the buckets and substitute those names in the subsequent steps of this recipe.

  7. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select Run your own application option under Create a Job Flow. Select the Custom Jar option from the drop-down menu below that. Click on Continue.
    How to do it...
  8. Specify the S3 location of the c10-samples.jar in the Jar Location textbox of the next tab (the Specify Parameters tab). You should specify the location of the JAR in the format bucket_name/jar_name. In the JAR Arguments textbox, enter chapter1.WordCount followed by the bucket location where you uploaded the input data and the output path. The output path should not exist and we use a directory (wc-output-data/out1) inside the output bucket you created in Step 6 as the output path. You should specify the locations using the format, s3n://bucket_name/path. Click on Continue.
    How to do it...
  9. Leave the default options and click on Continue in the next tab, Configure EC2 Instances. The default options use two EC2 m1.small instances for the Hadoop slave nodes and one EC2 m1.small instance for the Hadoop master node.
  10. In the Advanced Options tab, enter the path of S3 bucket you created above for the logs in the Amazon S3 Log Path textbox. Select Yes for the Enable Debugging. Click on Continue.
    How to do it...
  11. Click on Continue in the Bootstrap Options. Review your job flow in the Review tab and click on Create Job Flow to launch instances and to run the MapReduce computation.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 11. Refer to the Saving money using Amazon EC2 Spot Instances for EMR recipe below to find out how you can save money by using Amazon EC2 Spot instances.

  12. Click on Refresh in the EMR console to monitor the progress of your MapReduce job. Select your job flow entry and click on Debug to view the logs and to debug the computation. As EMR uploads the logfiles periodically, you might have to wait and refresh to access the logfiles. Check the output of the computation in the output data bucket using the AWS S3 console.
    How to do it...

See also

  • Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop and Running WordCount program in a distributed cluster environment recipes from Chapter 1, Getting Hadoop up and running in a Cluster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset