This recipe will show how to use Hive to perform joins across two datasets.
The first dataset is the Human Development Report by country. HDR describes different countries based on several human development measures. You can find this dataset at http://hdr.undp.org/en/statistics/data/.
This recipe will use Hive to process the dataset and create a list of countries that has more than 2000$ of gross national income per capita, and then join them with export dataset.
This recipe assumes that the earlier recipe has been performed. Install Hive and follow the earlier recipe if you have not done so already.
This section demonstrates how to perform a join using Hive.
resources/chapter5/export-data.csv
to the HIVE_HOME
directory.HIVE_HOME
and running the following command:>bin/hive
hive> CREATE TABLE EXPO(country STRING, expct FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 0.758 seconds
hive> LOAD DATA LOCAL INPATH 'export-data.csv' INTO TABLE EXPO; Copying data from file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv Copying file: file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv Loading data to table default.expo OK Time taken: 0.391 seconds
Now we can join the two tables using Hive's SQL-like join command.
hive> SELECT h.country, gni, expct FROM HDI h JOIN EXPO e ON (h.country = e.country) WHERE gni> 2000;
If successful it will print the following and print the results to the console:
Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): ... 2012-06-07 21:19:04,978 Stage-1 map = 0%, reduce = 0% 2012-06-07 21:19:23,169 Stage-1 map = 50%, reduce = 0% .. MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 HDFS Read: 13809 HDFS Write: 2955 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK
The final result would look like the following:
Albania 7803 29.77231 Algeria 7658 30.830406 Andorra 36095 NULL Angola 4874 56.835884 Antigua and Barbuda 15521 44.08267 Argentina 14527 21.706469 Armenia 5188 20.58361 Australia 34431 19.780243 Austria 35719 53.971355 ... Time taken: 64.856 seconds
When executed, Hive commands first define and load the second table and data. Then it converts the join command into MapReduce job and carries out the join by running the MapReduce job.
Hive supports most SQL commands such as GROUP BY
and ORDER BY
, with the same semantics as SQL. You can find more details about Hive commands from https://cwiki.apache.org/confluence/display/Hive/Tutorial.