Creating an Amazon EMR job flow using the Command Line Interface

Amazon also provides a Ruby-based Command Line Interface (CLI) for EMR. The EMR Command Line Interface supports creating job flows with multiple steps as well.

This recipe creates a job flow using the EMR CLI to execute the WordCount sample from the Running Hadoop MapReduce computations using Amazon ElasticMapReduce (EMR) recipe of this chapter.

How to do it...

The following steps show you how to create an EMR job flow using the EMR command line interface:

  1. Install Ruby 1.8 in your machine. You can verify the version of your Ruby installation by using the following command:
    > ruby –v
    ruby 1.8……
    
  2. Create a new directory. Download the EMR Ruby CLI from http://aws.amazon.com/developertools/2264 and unzip it to the newly created directory.
  3. Create an Amazon EC2 key pair by logging in to the AWS EC2 console (https://console.aws.amazon.com/ec2). To create a key pair, log in to the EC2 dashboard, select a region and click on Key Pairs under the Network and Security menu. Click on the Create Key Pair button in the Key Pairs window and provide a name for the new key pair. Download and save the private key file (PEM format) in a safe location.

    Tip

    Make sure to set the appropriate file access permissions for the downloaded private key file.

  4. Save the following JSON snippet in to a file named credentials.json in the directory of the extracted EMR CLI. Fill the fields using the credentials of your AWS account. A sample credentials.json file is available in the resources/emr-cli folder of the resource bundle available for this chapter.
    • You can retrieve your AWS Access Keys from the AWS console (http://console.aws.amazon.com) by clicking on Security Credentials in the context menu that appears by clicking your AWS username in the upper-right corner of the console. You can also retrieve the AWS Access Keys by clicking on the Security Credentials web page link in the AWS My Account portal as well.
    • Provide the name of your Key Pair (created in step 3) as the value of the key pair property.
    • Provide the path of the saved private key file as the value of the key-pair file property.
    • Create a S3 bucket to store the logs of the computation. Provide the S3 bucket name as the value of the log_uri property to store the logging and the debugging information. We assume the S3 bucket name for logging as c10-logs.
    • You can use either us-east-1, us-west-2, us-west-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1 as the AWS region.
      {
      "access_id": "[Your AWS Access Key ID]",
      "private_key": "[Your AWS Secret Access Key]",
      "keypair": "[Your key pair name]",
      "key-pair-file": "[The path and name of your PEM file]",
      "log_uri": "s3n://c10-logs/",
      "region": "us-east-1"
      }

      Tip

      You can skip to step 8, if you have completed the steps 2 to 6 of the Running Hadoop MapReduce computations using Amazon ElasticMapReduce (EMR) recipe on this chapter.

  5. Create a bucket to upload the input data by clicking on Create Bucket in the Amazon S3 monitoring console (https://console.aws.amazon.com/s3). Provide a unique name for your bucket. Upload your input data to the newly-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
  6. Create a S3 bucket to upload the JAR file needed for our MapReduce computation. Upload the c10-samples.jar to the newly created bucket.
  7. Create a S3 bucket to store the output data of the computation.
  8. Create a job flow by executing the following command inside the directory of the unzipped CLI. Replace the paths of the JAR file, input data location and the output data location with the locations you used in steps 5, 6, and 7.
    > ./elastic-mapreduce --create --name "Hello EMR CLI" 
    --jar s3n://[S3 jar file bucket]/c10-samples.jar 
    --arg chapter1.WordCount 
    --arg s3n://[S3 input data path] 
    --arg s3n://[S3 output data path]
    

    The preceding commands will create a job flow and display the job flow ID.

    Created job flow x-xxxxxx
    
  9. You can use the following command to view the description of your job flow. Replace <job-flow-id> using the job flow ID displayed in step 8.
    >./elastic-mapreduce --describe <job-flow-id>
    {
      "JobFlows": [
        {
          "SupportedProducts": [],
    ………
    
  10. You can use the following command to list and to check the status of your job flows. You can also check the status and debug your job flow using the Amazon EMR Console (https://console.aws.amazon.com/elasticmapreduce) as well.
    >./elastic-mapreduce --list
    x-xxxxxxx      STARTING                     Hello EMR CLI
       PENDING        Example Jar Step             
    ……..
    
  11. Once the job flow is completed, check the result of the computation in the output data location using the S3 console.
    >./elastic-mapreduce --list
    x-xxxxxx   COMPLETED    ec2-xxx.amazonaws.com     Hello EMR CLI
       COMPLETED      Example Jar Step
    

There's more...

You can use EC2 spot instances with your job flows to reduce the cost of your computations. Add a bid price to your request by adding the following commands to your job flow create command:

>./elastic-mapreduce --create --name …. 
.........
--instance-group master --instance-type m1.small 
--instance-count 1 --bid-price 0.01 
--instance-group core   --instance-type m1.small 
--instance-count 2  --bid-price 0.01

Refer to the Saving money by using Amazon EC2 Spot Instances to execute EMR job flows recipe in this chapter for more details on Amazon Spot Instances.

See also

  • The Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR) recipe of this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset