Using Hive to intersect weblog IPs and determine the country

Hive does not directly support foreign keys. Nevertheless, it is still very common to join records on identically matching keys contained in one or more tables. This recipe will show a very simple inner join over weblog data that links each request record in the weblog_entries table to a country, based on the request IP.

For each record contained in the weblog_entries table, the query will print the record out with an additional trailing value showing the determined country.

Getting ready

Make sure that you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the weblog_entries dataset loaded into a Hive table named weblog_entries with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe weblog_entries

You should see the following response:

OK
md5 string
url string
request_date string
request_time string
ip string

Additionally, this recipe requires that the ip-to-country dataset be loaded into a Hive table named ip_to_country with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe ip_to_country

You should see the following response:

OK
Ip string
country string

How to do it...

Carry out the following steps to perform an inner join in HiveQL:

  1. Open a text editor of your choice, ideally one with SQL syntax highlighting.
  2. Add the following inline creation syntax:
    SELECT wle.*, itc.country FROM weblog_entries wle
        JOIN ip_to_country itc ON wle.ip = itc.ip;
  3. Save the script as weblog_simple_ip_join.hql in the active directory.
  4. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the results of the SELECT statement printed out to the console. The following snippet is a printout containing only two sample rows. The full printout will contain all 3000 rows:
    11402ba8f780f7fbfb108f213449e1b9  /u.html  2012-05-10  21:19:05  98.90.200.33 United States
    7ffb8f8ed136c5fe3a5dd6eedc32eae7  /cx.html  2012-05-10  21:17:05  59.19.27.24  Korea, Republic of

How it works...

The statement SELECT wle.* tells Hive to print every column for each record contained in the table weblog_entires, which is an alias for wle in shorthand.

Additionally, the JOIN operator tells Hive to perform a lookup in the ip_to_country table for each record, and find the specific country that maps to that weblog record's IP address. In other words, our join key is the IP address contained in both the tables.

There's more...

The following are a few more helpful introductory tips for the Hive JOIN syntax.

Hive supports multitable joins

A single SELECT statement can use multiple instances of the JOIN <table> ON syntax to match the conditions contained in multiple tables.

The ON operator for inner joins does not support inequality conditions

As of Hive 0.7.1, ON conditions cannot match records based on inequality.

The same query from the recipe will fail once the conditional operator is changed to inequality.

The following is the same query from the recipe, except that we wish to match every record for which the IP does not match an IP in the JOIN table:

SELECT wle.*, itc.country FROM weblog_entries wle
    JOIN ip_to_country itc ON wle.ip != itc.ip;

This query produces the following error:

FAILED: Error in semantic analysis: Line 2:30 Both left and right aliases encountered in JOIN ip

See also

This recipe is designed as a quick reference for simple table joins. More advanced Hive joins are covered in depth in the following recipes of Chapter 5, Advanced Joins:

  • Joining data in the Mapper using MapReduce
  • Joining data using Apache Pig replicated join
  • Join sorted data using Apache Pig merge join
  • Using a map-side join in Apache Hive to analyze geographical events
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset