Running a job on the cluster

With the connectivity established, you can now execute jobs as one or more steps on your cluster. In this section, we will be demonstrating the working of a step using a simple example which involves the processing of a few Amazon CloudFront logs. The details of the sample data and script can be found at: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs-prepare-data-and-script.html. You can use similar techniques and bases to create and execute your own jobs as well:

  1. To get started with a job, from the EMR dashboard select your cluster's name from the Cluster list page. This will bring up the newly created clusters details page. Here, select the Steps tab.
  1. Since this is going to be our first step, go ahead and click on the Add step option. This brings up the Add step dialog as shown in the following screenshot. Fill in the required information as described and, once all the fields are filled in, click on Add to complete the step's creation:
    • Step type: You can choose between various options such as Streaming program which essentially will prompt you to provide Mapper and Reducer function details, or alternatively, you can also select Hive program, Pig program, Spark program or a Custom application. In this case, we select the Hive program option.
    • Name: A suitable name for your step.
    • Script S3 location: Provide the Hive script's location here. Since we are using a predefined script, simply replace the <REGION> field with your EMR's operating region: s3://<REGION>.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q.
    • Input S3 location: Provide the input data file's location here. Replace the <REGION> placeholder with your EMR's operating region as done before: s3://<REGION>.elasticmapreduce.samples.
    • Output S3 location: Specify where the processed output files have to be stored. In this case, I'm using the custom S3 bucket that we created as a prerequisite step during the EMR cluster creation. You can provide any other alternative bucket as well.
    • Arguments: You can use this field to provide any optional arguments required by the script to run. In this case, copy, and paste the following -hiveconf hive.support.sql11.reserved.keywords=false.
    • Action on failure: You can optionally choose what EMR should do in case the step's execution undergoes a failure. In this case, we have selected the default Continue value.
  1. Once the required fields are filled out, click on Add to complete the process.

The step now starts executing the supplied script on the EMR cluster. You can view the progress by viewing the changes in the step's status from Pending to Running to Completed, as shown in the following screenshot:

Once the job completes its execution, head back to your Amazon S3's output bucket and view the output of the processing. In this case, the output contains the number of access requests made to CloudFront, sorted by the operating system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset