This recipe will take a list of Nigerian VIPs and join any Nigerian ACLED events that occurred on any VIP's birthday. We are not only interested in viewing events that did or did not occur on a famous person's birthday, but also in the people who are not linked to any event. To accomplish this analytics in a single query, a full outer join makes the most sense. We would also like to store the results in a table.
Ensure that Apache Hive 0.7.1 is installed on your client machine and on the environment path for the active user account.
This recipe depends on having the Nigera_ACLED_cleaned.tsv
dataset loaded into a Hive table with the name acled_nigeria_cleaned
and with the following fields mapped to the respective datatypes. The
Nigera_ACLED_cleaned.tsv
dataset can be downloaded from http://www.packtpub.com/support.
Issue the following command to the Hive client:
describe acled_nigeria_cleaned
You should see the following response:
OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int
This recipe also requires having nigeria-vip-birthdays.tsv
loaded into a Hive table with the name nigeria_vips
and with the following fields mapped to the respective datatypes. The nigeria-vip-birthdays.tsv
dataset can be downloaded from http://www.packtpub.com/support.
Issue the following command to the Hive client:
describe nigeria_vips
You should see the following response:
OK name string birthday string description string
Follow the steps to perform a full outer join in Hive:
full_outer_join_acled_vips.sql
.DROP TABLE IF EXISTS acled_nigeria_event_people_links; CREATE TABLE acled_nigeria_event_people_links AS SELECT acled.event_date, acled.event_type, vips.name, vips.description as pers_desc, vips.birthday FROM nigeria_vips vips FULL OUTER JOIN acled_nigeria_cleaned acled ON (substr(acled.event_date,6) = substr(vips.birthday, 6));
full_outer_join_acled_vips.sql
script from the operating system shell by supplying the –f
option to the Hive client.acled_nigeria_event_people_links
.SELECT * FROM acled_nigeria_event_people_links WHERE event_date IS NOT NULL AND birthday IS NOT NULL limit 2";
OK 2008-01-01 Battle-No change of territory Jaja Wachuku "First speaker of the Nigerian House of Representatives" 1918-01-01 2002-01-01 Riots/Protests Jaja Wachuku "First speaker of the Nigerian House of Representatives" 1918-01-01
First, we drop any tables previously created by the name acled_nigeria_event_people_links
. We use an inline CREATE TABLE AS
statement to shortcut having to explicitly define the table.
The full outer join will match rows from acled_nigeria_cleaned
to rows from nigeria_vips
, where the substring of the records in the event_date
column, starting at the sixth character position, is equal to a VIP's birthday. We use the substr(event_date, 6)
method to eliminate the year portion of the records in the event_date
column as a comparison factor.
The columns our receiving table will contain
from the SELECT
statement are acled.event_date
, acled.event_type
, vips.name
, vips.description as pers_desc
, and vips.birthday
. The vips.description
column gets the alias pers_desc
to make the column label a little more meaningful. For event records with no matching birthdays, the columns vips.name
, vips.description
, and pers_desc
will be NULL
. For people with no matching events, the columns acled.event_date
and acled.event_type
will be NULL
.
The decision to include nigeria_vips
in the FROM
and JOIN
clauses on acled_nigeria_cleaned
was made to optimize the reducer throughput. Since we are performing a Hive common join and not a map-side join, the actual table joining will occur reduce-side. Hive will attempt to buffer the rows from the left-most table and then stream the rows from the right-most table. The table nigeria_vips
is much smaller than acled_nigeria_cleaned
, thus we can alleviate the reducer memory footprint by designing the syntax of our query to stream the rows of acled_nigeria_cleaned
and buffer those in nigeria_vips
.
As it turns out for this particular VIP list, there were no birthdays for which there was no event listed in acled_nigeria_cleaned
; therefore, the outer join produces no rows for people whose birthdays did not match an event. Moreover, no two people in our list have the same birthday; therefore, the outer join does not replicate the same event into multiple rows for each VIP birthday joined. The resulting table contains 2931 rows, which is exactly the same as the number of rows in acled_nigeria_cleaned
.
There are other things you can do to improve the performance of the join operations in Hive.
The Hive documentation will use the term "common join" to refer to a join operation where one or more reducers are required to physically join the table rows. Map-side joins, as the name would suggest, perform the join across parallel map tasks and eliminate the need for the reduce phase.
You can dictate which tables to stream during the reduce phase by using /*+ STREAMTABLE(tablename) */
.
The left-to-right ordering of join table declarations in your queries, especially in a multi-table join, is very important. Hive will attempt to buffer the rows from the left-hand side table and stream the results of the right-hand side. In a multi-table join, several map/reduce jobs may occur, but the same semantics apply. The result of the first join will be buffered, while the rows of the next right-most table will be streamed. Use this knowledge to order your table joins wisely.