Executing a Pig script using EMR

Amazon EMR supports executing Pig scripts on the data stored in S3. For more details on Pig, refer to the Installing Pig and Running your first Pig command recipes in Chapter 5, Hadoop Ecosystem.

In this recipe, we are going to execute the Pig script sample from the Running your first Pig commands recipe using Amazon EMR. This sample will use the Human Development Report data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than 2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show you how to use a Pig script with Amazon Elastic MapReduce to process a dataset stored on Amazon S3.

  1. Use the Amazon S3 console to create a bucket in S3 to upload the input data. Upload the resources/hdi-data.csv file in the source package associated with this chapter to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as c10-input-data/hdi-data.csv.
  2. Modify the Pig script from the Running your first Pig commands recipe of Chapter 5, Hadoop Ecosystem, to run it using EMR. Add a STORE command to save the result in the filesystem. Parameterize the LOAD command of the Pig script by adding $INPUT as the input file and the store command by adding $OUTPUT as the output directory. The modified Pig script is available in the resources/countryFilter-EMR.pig file of the resources associated with this chapter:
    A = LOAD ''$INPUT'' using PigStorage('','')  AS 
    (id:int, country:chararray, hdi:float, lifeex:int,
    mysch:int, eysch:int, gni:int);
    B = FILTER A BY gni > 2000;
    C = ORDER B BY gni;
    STORE C into ''$OUTPUT'';
  3. Use the Amazon S3 console to create a bucket in S3 to upload the Pig script. Upload the resources/countryFilter-EMR.pig script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as c10-resources/countryFilter-EMR.pig.
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select Run your own application option under Create a Job Flow. Select Pig Program option from the drop-down menu below that. Click on Continue.
  5. Specify the S3 location of the Pig script in the Script Location textbox of the next tab (the Specify Parameters tab). You should specify the location of the script in the format bucket_name/file_name. Specify the S3 location of the uploaded input data file In the Input Location textbox. In the Output Location textbox, specify a S3 location to store the output. The output path should not exist and we use a directory (c10-out/out4) inside the output bucket as the output path. You should specify the locations using the format, s3n://bucket_name/path. Click on Continue.
    How to do it...
  6. Configure the EC2 instances for the job flow and configure the log paths for the MapReduce computations in the next two tabs. Click on Continue on the Bootstrap Options screen. Review your job flow in the Review tab and click on Create Job Flow to launch instances and to execute the Pig script. Refer to the steps 9, 10, and 11 of the Running Hadoop MapReduce computations using Amazon ElasticMapReduce (EMR) recipe for more details.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking Create Job Flow in the step 11. Refer to the Saving money by using EC2 Spot Instances recipe to find out how you can save money by using Amazon EC2 Spot instances.

  7. Click on Refresh in the EMR console to monitor the progress of your MapReduce job. Select your job flow entry and click on Debug to view the logs and to debug the computation. As EMR uploads the logfiles periodically; you might have to wait and refresh to access the logfiles. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR allows to us to use Apache Pig in the interactive mode as well.

Starting a Pig interactive session

Let's look at the steps to start a Pig interactive session:

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select the Run your own application option under Create a Job Flow. Select the Pig Program option from the drop-down menu below that. Click on Continue.
  2. In order to start an interactive Pig session, select the Start an Interactive Pig Session option of the Specify Parameters tab. Click on Continue.
  3. Configure the EC2 instances for the job flow in the Configure EC2 Instances tab. Click on Continue.
  4. You must select a key pair from the Amazon EC2 Key Pair drop-down box in the Advanced Options tab. In case you do not have a usable Amazon EC2 key pair, log in to the Amazon EC2 console and create a new key pair.
  5. Click on Continue on the Bootstrap Options screen. Review your job flow in the Review tab and click on Create Job Flow to launch instances.
  6. After the cluster is provisioned, go to the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce). Select the current job flow to view more information about the job flow. Retrieve the Master Public DNS Name value from the information pane. (If you need more information about this step, please refer to step 6 of the Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR recipe).
  7. Use the master public DNS name and the key file of the Amazon EC2 key pair you specified in step 4 to SSH in to the master node of the cluster:
    > ssh -i <path-to-the-key-file> hadoop@<master-public-DNS>
    
  8. Start the Pig interactive grunt shell in the master node and issue your Pig commands.

See also

  • The Running your first Pig commands recipe in Chapter 5, Hadoop Ecosystem.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset