Performing a join with Hive

This recipe will show how to use Hive to perform joins across two datasets.

The first dataset is the Human Development Report by country. HDR describes different countries based on several human development measures. You can find this dataset at http://hdr.undp.org/en/statistics/data/.

This recipe will use Hive to process the dataset and create a list of countries that has more than 2000$ of gross national income per capita, and then join them with export dataset.

Getting ready

This recipe assumes that the earlier recipe has been performed. Install Hive and follow the earlier recipe if you have not done so already.

How to do it...

This section demonstrates how to perform a join using Hive.

  1. From the sample directory, copy the resources/chapter5/export-data.csv to the HIVE_HOME directory.
  2. Start Hive by changing the directory to HIVE_HOME and running the following command:
    >bin/hive
    
  3. We will create a second table to join with the table we loaded in the earlier recipe.
    hive>  CREATE TABLE EXPO(country STRING, expct FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
    OK
    Time taken: 0.758 seconds
    
  4. We will load the data into the new table by running the following command with Hive. As explained in the earlier recipe, this will move the data to the storage location for the table and parse the data according to the table definition.
    hive> LOAD DATA LOCAL INPATH 'export-data.csv' INTO TABLE EXPO;
    
    Copying data from file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv
    Copying file: file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv
    Loading data to table default.expo
    OK
    Time taken: 0.391 seconds
    

    Now we can join the two tables using Hive's SQL-like join command.

    hive> SELECT h.country, gni, expct FROM HDI h JOIN EXPO e ON (h.country = e.country) WHERE gni> 2000;
    

    If successful it will print the following and print the results to the console:

    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
    ...
    2012-06-07 21:19:04,978 Stage-1 map = 0%,  reduce = 0%
    2012-06-07 21:19:23,169 Stage-1 map = 50%,  reduce = 0%
    ..
    MapReduce Jobs Launched:
    Job 0: Map: 2  Reduce: 1   HDFS Read: 13809 HDFS Write: 2955 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    

    The final result would look like the following:

    Albania	7803	29.77231
    Algeria	7658	30.830406
    Andorra	36095	NULL
    Angola	4874	56.835884
    Antigua and Barbuda	15521	44.08267
    Argentina	14527	21.706469
    Armenia	5188	20.58361
    Australia	34431	19.780243
    Austria	35719	53.971355
    
    ...
    Time taken: 64.856 seconds
    

How it works...

When executed, Hive commands first define and load the second table and data. Then it converts the join command into MapReduce job and carries out the join by running the MapReduce job.

There's more...

Hive supports most SQL commands such as GROUP BY and ORDER BY, with the same semantics as SQL. You can find more details about Hive commands from https://cwiki.apache.org/confluence/display/Hive/Tutorial.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset