Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performing a join with Hive

This recipe will show how to use Hive to perform joins across two datasets.

The first dataset is the Human Development Report by country. HDR describes different countries based on several human development measures. You can find this dataset at http://hdr.undp.org/en/statistics/data/.

This recipe will use Hive to process the dataset and create a list of countries that has more than 2000$ of gross national income per capita, and then join them with export dataset.

Getting ready

This recipe assumes that the earlier recipe has been performed. Install Hive and follow the earlier recipe if you have not done so already.

How to do it...

This section demonstrates how to perform a join using Hive.

From the sample directory, copy the resources/chapter5/export-data.csv to the HIVE_HOME directory.
Start Hive by changing the directory to HIVE_HOME and running the following command:
```
>bin/hive
```

We will create a second table to join with the table we loaded in the earlier recipe.

hive>  CREATE TABLE EXPO(country STRING, expct FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.758 seconds

We will load the data into the new table by running the following command with Hive. As explained in the earlier recipe, this will move the data to the storage location for the table and parse the data according to the table definition.

hive> LOAD DATA LOCAL INPATH 'export-data.csv' INTO TABLE EXPO;

Copying data from file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv
Copying file: file:/Users/srinath/playground/hadoop-book/hive-0.9.0/export-data.csv
Loading data to table default.expo
OK
Time taken: 0.391 seconds

Now we can join the two tables using Hive's SQL-like join command.

hive> SELECT h.country, gni, expct FROM HDI h JOIN EXPO e ON (h.country = e.country) WHERE gni> 2000;

If successful it will print the following and print the results to the console:

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
...
2012-06-07 21:19:04,978 Stage-1 map = 0%,  reduce = 0%
2012-06-07 21:19:23,169 Stage-1 map = 50%,  reduce = 0%
..
MapReduce Jobs Launched:
Job 0: Map: 2  Reduce: 1   HDFS Read: 13809 HDFS Write: 2955 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

The final result would look like the following:

Albania	7803	29.77231
Algeria	7658	30.830406
Andorra	36095	NULL
Angola	4874	56.835884
Antigua and Barbuda	15521	44.08267
Argentina	14527	21.706469
Armenia	5188	20.58361
Australia	34431	19.780243
Austria	35719	53.971355

...
Time taken: 64.856 seconds

How it works...

When executed, Hive commands first define and load the second table and data. Then it converts the join command into MapReduce job and carries out the join by running the MapReduce job.

There's more...

Hive supports most SQL commands such as GROUP BY and ORDER BY, with the same semantics as SQL. You can find more details about Hive commands from https://cwiki.apache.org/confluence/display/Hive/Tutorial.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Performing a join with Hive

Create new playlist

Sign In

Sign Up

Performing a join with Hive

Getting ready

How to do it...

How it works...

There's more...

Table of Contents for
Performing a join with Hive