Running a SQL-style query with Hive

This recipe explains how you can use Hive to perform data processing operations using its SQL-style language.

In this recipe, we will use a data set that includes Human Development Report (HDR) by country. HDR describes different countries based on several human development measures. You can find the dataset from http://hdr.undp.org/en/statistics/data/.

Getting ready

For this recipe, you need a working Hive installation. If you have not done it already, please follow the previous recipe to install Hive.

How to do it...

This section depicts how to use Hive for filtering and sorting.

  1. Copy the resources/chapter5/hdi-data.csv file to HIVE_HOME directory.
  2. Start Hive by changing the directory to HIVE_HOME and running the following command:
    >bin/hive
    
  3. Let's first define a table to be used to read data, by running the following Hive command.

    Note

    The table definition only creates the table layout; it does not put any data into the table.

    hive> CREATE TABLE HDI(id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch INT, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
    
    OK
    Time taken: 11.719 seconds
    
  4. Let's use the LOAD command to load the data to the table. It is worth noting that the LOAD command copies the file without any alteration to the storage location of the table as defined by the table definition. Then, it uses the formats defined in the table definition to parse the data and load it to the table. For example, the table definition in step 3 defines a table HDI that stores the data as a text file terminated with ',' (CSV format). The input we provide for the LOAD command must follow the CSV format as per table definition.
    hive> LOAD DATA LOCAL INPATH 'hdi-data.csv' INTO TABLE HDI;
    Copying data from file:/Users/srinath/playground/hadoop-book/hive-0.9.0/hdi-data.csv
    Copying file: file:/Users/srinath/playground/hadoop-book/hive-0.9.0/hdi-data.csv
    Loading data to table default.hdi
    OK
    Time taken: 1.447 seconds
    
  5. Now we can run the query on the defined table using the Hive SQL-like syntax:
    hive> SELECT country, gni from HDI WHERE gni> 2000;
    

    If the command is successful, Hive will print the following information and finally print the results to the screen.

    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201206062316_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201206062316_0007
    Kill Command = /Users/srinath/playground/hadoop-book/hadoop-1.0.0/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201206062316_0007
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2012-06-07 20:45:32,638 Stage-1 map = 0%,  reduce = 0%
    2012-06-07 20:45:38,705 Stage-1 map = 100%,  reduce = 0%
    2012-06-07 20:45:44,751 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201206062316_0007
    MapReduce Jobs Launched:
    Job 0: Map: 1   HDFS Read: 9401 HDFS Write: 2435 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    

The final results will look like following:

Norway     47557
Australia     34431
Netherlands     36402
United States     43017
New Zealand     23737
Canada     35166
...

How it works...

When we run the Hive, we first define a table and load the data from a file into the table. It is worth noting that the table definition must match the input data file formats, and the LOAD command copies the files into the table's storage location without any change and then tries to parse the file according to the table definitions.

Once the data is loaded, we can use Hive commands to process the data using SQL-like syntax. For example, the following command selects rows from the table that have a GNI value that is more than 2000:

SELECT country, gni from HDI WHERE gni> 2000;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset