Executing a Hive script using EMR

Amazon EMR supports executing Hive queries on the data stored in S3. For more details on Hive, refer to the Installing Hive, Running SQL-style query with Hive and Performing a join with Hive recipes in Chapter 5, Hadoop Ecosystem.

In this recipe, we are going to execute the Hive queries from the Running SQL style Query with Hive recipe using Amazon EMR. This sample will use the Human Development Report data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than 2000$ of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show you how to use a Hive script with Amazon Elastic MapReduce to query a data set stored on Amazon S3.

  1. Use the Amazon S3 console to create a bucket in S3 to upload the input data. Upload the resources/hdi-data.csv file in the source package associate with this chapter to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as c10-input-data/hdi-data.csv.
  2. Create a Hive batch script using the queries in the Running SQL-style query with Hive recipe of Chapter 5, Hadoop Ecosystem. Create a Hive table to store the result of the Select query. The Hive batch script is available in the resources/countryFilter-EMR.hive file of the resources associated with this chapter.
    CREATE TABLE HDI(
      id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch INT, gni INT
      ) 
      ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '','' 
      STORED AS TEXTFILE
      LOCATION ''s3://c10-input-data/hdi-data.csv'';
    
    CREATE EXTERNAL TABLE output_countries(
        country STRING, gni INT
        )    
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '',''    
        STORED AS TEXTFILE
        LOCATION ''${OUTPUT}/countries''
      ;
    
    INSERT OVERWRITE TABLE output_countries
      SELECT 
        country, gni 
      FROM 
        HDI 
      WHERE 
        gni > 2000;
  3. Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the resources/countryFilter-EMR.hive script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as c10-resources/countryFilter-EMR.hive.
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select the Run your own application option under the Create a Job Flow. Select the Hive Program option from the drop down menu below that. Click on Continue.
  5. Specify the S3 location of the hive script in the Script Location textbox of the next tab (Specify Parameters tab). You should specify the location of the script in the format bucket_name/file_name. Specify the S3 location of the uploaded input data file In the Input Location textbox. In the Output Location textbox, specify a S3 location to store the output. The output path should not exist and we use a directory (c10-out/hive-out-1) inside the output bucket as the output path. You should specify the input and output locations using the format, s3n://bucket_name/path. Click on Continue.
    How to do it...
  6. Configure the EC2 instances for the job flow and configure the log paths for the MapReduce computations in the next two tabs. Click on Continue on the Bootstrap Options screen. Review your job flow in the Review tab and click on Create Job Flow to launch instances and to execute the Pig script. Refer to steps 9, 10, and 11 of the Running Hadoop MapReduce computations using Amazon ElasticMapReduce (EMR) recipe for more details.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 11. Refer to the Saving money by using EC2 Spot Instances recipe to find out how you can save money by using Amazon EC2 Spot Instances.

  7. Click on Refresh in the EMR console to monitor the progress of your MapReduce job. Select your job flow entry and click Debug to view the logs and to debug the computation. As EMR uploads the log files periodically; you might have to wait and refresh to access the logfiles. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR also allows to us to use Hive in the interactive mode as well.

Starting a Hive interactive session

Let's look at the steps to start a Hive interactive session:

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select the Run your own application option under the Create a Job Flow. Select the Hive Program option from the drop-down menu below that. Click on Continue.
  2. In order to start an interactive Hive session, select the Start an Interactive Hive Session option of the Specify Parameters tab. Click on Continue.
  3. Configure the EC2 instances for the job flow in the Configure EC2 Instances tab. Click on Continue.
  4. You must select a key pair from the Amazon EC2 Key Pair drop-down box in the Advanced Options tab. In case you do not have a usable Amazon EC2 key pair, log in to the Amazon EC2 console and create a new key pair.
  5. Click on Continue in Bootstrap Options. Review your job flow in the Review tab and click on Create Job Flow to launch instances.
  6. After the cluster is provisioned, go to the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce). Select the current job flow to view more information about the job flow. Retrieve the Master Public DNS Name from the information pane. (If you need more information about this step, please refer to step 6 of the Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR recipe).
  7. Use the master public DNS name and the key file of the Amazon EC2 key pair you specified in step 4 to SSH in to the master node of the cluster.
    > ssh -i <path-to-the-key-file> hadoop@<master-public-DNS>
    
  8. Start the Hive shell in the master node and issue your Hive queries.

See also

  • The Running SQL style Query with Hive recipe of Chapter 5, Hadoop Ecosystem.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset