This recipe explains how you can use Hive to perform data processing operations using its SQL-style language.
In this recipe, we will use a data set that includes Human Development Report (HDR) by country. HDR describes different countries based on several human development measures. You can find the dataset from http://hdr.undp.org/en/statistics/data/.
For this recipe, you need a working Hive installation. If you have not done it already, please follow the previous recipe to install Hive.
This section depicts how to use Hive for filtering and sorting.
resources/chapter5/hdi-data.csv
file to HIVE_HOME
directory.HIVE_HOME
and running the following command:>bin/hive
hive> CREATE TABLE HDI(id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch INT, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 11.719 seconds
LOAD
command to load the data to the table. It is worth noting that the LOAD
command copies the file without any alteration to the storage location of the table as defined by the table definition. Then, it uses the formats defined in the table definition to parse the data and load it to the table. For example, the table definition in step 3 defines a table HDI
that stores the data as a text file terminated with ','
(CSV format). The input we provide for the LOAD
command must follow the CSV format as per table definition.hive> LOAD DATA LOCAL INPATH 'hdi-data.csv' INTO TABLE HDI; Copying data from file:/Users/srinath/playground/hadoop-book/hive-0.9.0/hdi-data.csv Copying file: file:/Users/srinath/playground/hadoop-book/hive-0.9.0/hdi-data.csv Loading data to table default.hdi OK Time taken: 1.447 seconds
hive> SELECT country, gni from HDI WHERE gni> 2000;
If the command is successful, Hive will print the following information and finally print the results to the screen.
Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201206062316_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201206062316_0007 Kill Command = /Users/srinath/playground/hadoop-book/hadoop-1.0.0/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201206062316_0007 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2012-06-07 20:45:32,638 Stage-1 map = 0%, reduce = 0% 2012-06-07 20:45:38,705 Stage-1 map = 100%, reduce = 0% 2012-06-07 20:45:44,751 Stage-1 map = 100%, reduce = 100% Ended Job = job_201206062316_0007 MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 9401 HDFS Write: 2435 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK
The final results will look like following:
Norway 47557 Australia 34431 Netherlands 36402 United States 43017 New Zealand 23737 Canada 35166 ...
When we run the Hive, we first define a table and load the data from a file into the table. It is worth noting that the table definition must match the input data file formats, and the LOAD
command copies the files into the table's storage location without any change and then tries to parse the file according to the table definitions.
Once the data is loaded, we can use Hive commands to process the data using SQL-like syntax. For example, the following command selects rows from the table that have a GNI value that is more than 2000
:
SELECT country, gni from HDI WHERE gni> 2000;