Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on identically matching keys contained in one or more tables.
This recipe will show a very simple inner join over weblog data that links each request record in the weblog_entries
table to a country, based on the request IP.
For each record contained in the weblog_entries
table, the query will print the record out with an additional trailing value showing the determined country.
Make sure that you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the weblog_entries
dataset loaded into a Hive table named weblog_entries
with the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe weblog_entries
You should see the following response:
OK md5 string url string request_date string request_time string ip string
Additionally, this recipe requires that the ip-to-country
dataset be loaded into a Hive table named ip_to_country
with the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe ip_to_country
You should see the following response:
OK Ip string country string
Carry out the following steps to perform an inner join in HiveQL:
SELECT wle.*, itc.country FROM weblog_entries wle JOIN ip_to_country itc ON wle.ip = itc.ip;
weblog_simple_ip_join.hql
in the active directory.–f
option to the Hive client. You should see the results of the SELECT
statement printed out to the console. The following snippet is a printout containing only two sample rows. The full printout will contain all 3000 rows:11402ba8f780f7fbfb108f213449e1b9 /u.html 2012-05-10 21:19:05 98.90.200.33 United States 7ffb8f8ed136c5fe3a5dd6eedc32eae7 /cx.html 2012-05-10 21:17:05 59.19.27.24 Korea, Republic of
The statement SELECT wle.*
tells Hive to print every column for each record contained in the table weblog_entires
, which is an alias for wle
in shorthand.
Additionally, the JOIN
operator tells Hive to perform a lookup in the ip_to_country
table for each record, and find the specific country that maps to that weblog record's IP address. In other words, our join key is the IP address contained in both the tables.
The following are a few more helpful introductory tips for the Hive JOIN
syntax.
A single SELECT
statement can use multiple instances of the JOIN <table> ON
syntax to match the conditions contained in multiple tables.
As of Hive 0.7.1, ON
conditions cannot match records based on inequality.
The same query from the recipe will fail once the conditional operator is changed to inequality.
The following is the same query from the recipe, except that we wish to match every record for which the IP does not match an IP in the
JOIN
table:
SELECT wle.*, itc.country FROM weblog_entries wle JOIN ip_to_country itc ON wle.ip != itc.ip;
This query produces the following error:
FAILED: Error in semantic analysis: Line 2:30 Both left and right aliases encountered in JOIN ip
This recipe is designed as a quick reference for simple table joins. More advanced Hive joins are covered in depth in the following recipes of Chapter 5, Advanced Joins: