Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs

EMR Bootstrap actions provide us a mechanism to configure the EC2 instances before running our MapReduce computations. The examples of Bootstrap actions include providing custom configuration for Hadoop, installing of any dependent software, distributing a common dataset, and so on. Amazon provides a set of predefined Bootstrap actions as well as allows us to write our own custom Bootstrap actions as well. EMR runs the Bootstrap actions in each instance before the Hadoop is started.

In this recipe, we are going to use a stop words list to filter out the common words from our WordCount sample. We download the stop words list to the workers using a custom Bootstrap action.

How to do it...

The following steps show you how to download a file to all the EC2 instances of an EMR computation using a Bootstrap script.

  1. Save the following script to a file named download-stopwords.sh. Upload the file to a Blob container in the Amazon S3. This custom Bootstrap file downloads a stop words list to each instance and copy it to a pre-designated directory inside the instance.
    #!/bin/bash
    set -e
    wget http://www.textfixer.com/resources/common-english-words-  with-contractions.txt
    mkdir –p /home/Hadoop/stopwords
    mv common-english-words-with-contractions.txt   /home/Hadoop/stopwords
  2. Complete steps 1 to 10 of the Running Hadoop MapReduce computations using Amazon ElasticMapReduce (EMR) recipe in this chapter.
  3. Select the Configure your Boostrap Actions option in the Bootstrap Options tab. Select Custom Action in the Action Type drop-down box. Give a name to your action in the Name textbox and provide the S3 path of the location where you uploaded the download-stopwords.sh in the Amazon S3 Location textbox. Click on Continue.
    How to do it...
  4. Review your job flow in the Review tab and click on Create Job Flow to launch instances and to run the MapReduce computation.
  5. Click on Refresh in the EMR console to monitor the progress of your MapReduce job. Select your job flow entry and click on Debug to view the logs and to debug the computation.

There's more...

Amazon provides us with the following predefined Bootstrap actions:

  • configure-daemons: This allows us to set Java Virtual Machine (JVM) options for the Hadoop daemons such as the heap size and garbage collections behaviour.
  • configure-hadoop: This allows us to modify the Hadoop configuration settings. We can either upload a Hadoop configuration XML or we can specify individual configuration options as key-value pairs.
  • memory-intensive: This configures the Hadoop cluster for memory-intensive workloads.
  • run-if: Run a Bootstrap action based on a property of an instance. This action can be used in scenarios where we want to run a command only in the Hadoop master node.

You can also create shutdown actions by writing scripts to a designated directory in the instance. Shutdown actions are executed after the job flow is terminated.

Refer to http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset