Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using optimized full outer joins in Apache Hive to analyze geographical events

This recipe will take a list of Nigerian VIPs and join any Nigerian ACLED events that occurred on any VIP's birthday. We are not only interested in viewing events that did or did not occur on a famous person's birthday, but also in the people who are not linked to any event. To accomplish this analytics in a single query, a full outer join makes the most sense. We would also like to store the results in a table.

Getting ready

Ensure that Apache Hive 0.7.1 is installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table with the name acled_nigeria_cleaned and with the following fields mapped to the respective datatypes. The Nigera_ACLED_cleaned.tsv dataset can be downloaded from http://www.packtpub.com/support.

Issue the following command to the Hive client:

describe acled_nigeria_cleaned

You should see the following response:

OK
loc string
event_date  string
event_type string
actor  string
latitude  double
longitude  double
source  string
fatalities  int

This recipe also requires having nigeria-vip-birthdays.tsv loaded into a Hive table with the name nigeria_vips and with the following fields mapped to the respective datatypes. The nigeria-vip-birthdays.tsv dataset can be downloaded from http://www.packtpub.com/support.

Issue the following command to the Hive client:

describe nigeria_vips

You should see the following response:

OK
name  string
birthday string
description string

How to do it...

Follow the steps to perform a full outer join in Hive:

Open a text editor and create a file named full_outer_join_acled_vips.sql.

Add the inline creation and transformation syntax:

DROP TABLE IF EXISTS acled_nigeria_event_people_links;
CREATE TABLE acled_nigeria_event_people_links AS
SELECT acled.event_date, acled.event_type, vips.name, vips.description as pers_desc, vips.birthday
    FROM nigeria_vips vips
    FULL OUTER JOIN acled_nigeria_cleaned acled
        ON (substr(acled.event_date,6) = substr(vips.birthday, 6));

Run the full_outer_join_acled_vips.sql script from the operating system shell by supplying the –f option to the Hive client.
Once the script successfully finishes, it should signal 2931 records loaded into the table acled_nigeria_event_people_links.

Issue the following query to the Hive shell:

SELECT * FROM acled_nigeria_event_people_links WHERE event_date IS NOT NULL AND birthday IS NOT NULL limit 2";

You should see the following output:

OK
2008-01-01  Battle-No change of territory  Jaja   Wachuku "First speaker of the Nigerian House of Representatives"  1918-01-01

2002-01-01  Riots/Protests  Jaja Wachuku  "First speaker of the Nigerian House of Representatives"  1918-01-01

How it works...

First, we drop any tables previously created by the name acled_nigeria_event_people_links. We use an inline CREATE TABLE AS statement to shortcut having to explicitly define the table.

The full outer join will match rows from acled_nigeria_cleaned to rows from nigeria_vips, where the substring of the records in the event_date column, starting at the sixth character position, is equal to a VIP's birthday. We use the substr(event_date, 6) method to eliminate the year portion of the records in the event_date column as a comparison factor.

The columns our receiving table will contain from the SELECT statement are acled.event_date, acled.event_type, vips.name, vips.description as pers_desc, and vips.birthday. The vips.description column gets the alias pers_desc to make the column label a little more meaningful. For event records with no matching birthdays, the columns vips.name, vips.description, and pers_desc will be NULL. For people with no matching events, the columns acled.event_date and acled.event_type will be NULL.

The decision to include nigeria_vips in the FROM and JOIN clauses on acled_nigeria_cleaned was made to optimize the reducer throughput. Since we are performing a Hive common join and not a map-side join, the actual table joining will occur reduce-side. Hive will attempt to buffer the rows from the left-most table and then stream the rows from the right-most table. The table nigeria_vips is much smaller than acled_nigeria_cleaned, thus we can alleviate the reducer memory footprint by designing the syntax of our query to stream the rows of acled_nigeria_cleaned and buffer those in nigeria_vips.

As it turns out for this particular VIP list, there were no birthdays for which there was no event listed in acled_nigeria_cleaned; therefore, the outer join produces no rows for people whose birthdays did not match an event. Moreover, no two people in our list have the same birthday; therefore, the outer join does not replicate the same event into multiple rows for each VIP birthday joined. The resulting table contains 2931 rows, which is exactly the same as the number of rows in acled_nigeria_cleaned.

There's more...

There are other things you can do to improve the performance of the join operations in Hive.

Common join versus map-side join

The Hive documentation will use the term "common join" to refer to a join operation where one or more reducers are required to physically join the table rows. Map-side joins, as the name would suggest, perform the join across parallel map tasks and eliminate the need for the reduce phase.

STREAMTABLE hint

You can dictate which tables to stream during the reduce phase by using /*+ STREAMTABLE(tablename) */.

Table ordering in the query matters

The left-to-right ordering of join table declarations in your queries, especially in a multi-table join, is very important. Hive will attempt to buffer the rows from the left-hand side table and stream the results of the right-hand side. In a multi-table join, several map/reduce jobs may occur, but the same semantics apply. The result of the first join will be buffered, while the rows of the next right-most table will be streamed. Use this knowledge to order your table joins wisely.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using optimized full outer joins in Apache Hive to analyze geographical events

Create new playlist

Sign In

Sign Up