EMR Bootstrap actions provide us a mechanism to configure the EC2 instances before running our MapReduce computations. The examples of Bootstrap actions include providing custom configuration for Hadoop, installing of any dependent software, distributing a common dataset, and so on. Amazon provides a set of predefined Bootstrap actions as well as allows us to write our own custom Bootstrap actions as well. EMR runs the Bootstrap actions in each instance before the Hadoop is started.
In this recipe, we are going to use a stop words list to filter out the common words from our WordCount sample. We download the stop words list to the workers using a custom Bootstrap action.
The following steps show you how to download a file to all the EC2 instances of an EMR computation using a Bootstrap script.
download-stopwords.sh
. Upload the file to a Blob container in the Amazon S3. This custom Bootstrap file downloads a stop words list to each instance and copy it to a pre-designated directory inside the instance.#!/bin/bash set -e wget http://www.textfixer.com/resources/common-english-words- with-contractions.txt mkdir –p /home/Hadoop/stopwords mv common-english-words-with-contractions.txt /home/Hadoop/stopwords
download-stopwords.sh
in the Amazon S3 Location textbox. Click on Continue.Amazon provides us with the following predefined Bootstrap actions:
configure-daemons
: This allows us to set Java Virtual Machine (JVM) options for the Hadoop daemons such as the heap size and garbage collections behaviour.configure-hadoop
: This allows us to modify the Hadoop configuration settings. We can either upload a Hadoop configuration XML or we can specify individual configuration options as key-value pairs.memory-intensive
: This configures the Hadoop cluster for memory-intensive workloads.run-if
: Run a Bootstrap action based on a property of an instance. This action can be used in scenarios where we want to run a command only in the Hadoop master node.You can also create shutdown actions by writing scripts to a designated directory in the instance. Shutdown actions are executed after the job flow is terminated.
Refer to http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html for more information.