When joining two tables in Apache Hive, one table might be significantly smaller than the other. In such cases, Hive can push a hash table representing the smaller table over the distributed cache and join the tables entirely map-side, which can lead to better parallelism and job throughput. In this recipe, we will use a map-side join to attach any significant holiday information that may have occurred on a particular geographic event.
Ensure that Apache Hive 0.7.1 is installed on your client machine and on the environment path for the active user account.
This recipe depends on having the Nigera_ACLED_cleaned.tsv
dataset loaded into a Hive table with the name acled_nigeria_cleaned
and with the following fields mapped to the respective datatypes. The Nigera_ACLED_cleaned.tsv
dataset can be downloaded from http://www.packtpub.com/support.
Issue the following command to the Hive client:
describe acled_nigeria_cleaned
You should see the following response:
OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int
This recipe also requires having nigeria-holidays.tsv
loaded into a Hive table with the name nigeria_holidays
and the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe nigeria_holidays
You should see the following response:
OK yearly_date string description string
Carry out the following steps to perform a map-side join in Apache Hive:
map-join-acled-holidays.sql
.SELECT /*+ MAPJOIN(nh)*/ acled.event_date, acled.event_type, nh.description FROM acled_nigeria_cleaned acled JOIN nigeria_holidays nh ON (substr(acled.event_date, 6) = nh.yearly_date);
map-join-acled-holidays.sql
script from the operating system shell by supplying the –f
option to the Hive client. You will know the map-side join is working if you see this message in the output trace:Mapred Local Task Succeeded. Convert the Join into MapJoin
The generated MapReduce job should not have any reduce tasks.
2002-01-01 Riots/Protests New Years Day 2001-06-12 Battle-No change of territory Lagos State only; in memory of failed 1993 election 2002-05-29 Violence against civilians Democracy Day 2010-10-01 Riots/Protests Independence Day 2010-10-01 Violence against civilians Independence Day
The script inner joins the month-day portion of each record in the event_date
column in acled_nigeria_cleaned
to the yearly_date
column in nigeria_holidays
. substr(event_date, 6)
will omit the year portion from each record in the event_date
column by starting from the position of the sixth character. The inline hint to /*+ MAPJOIN(nh) */
lets you manually define which table alias to load as the small table to each mapper. The
nigeria_holidays
table is very small and made the most sense to load as a hash table. Each map process in the join can operate over rows from acled_nigeria_cleaned
with its own copy of the nigeria_holidays
hash table. The MAPJOIN
operation handles creating the hash table and distributing it to each map task.
We would like to see the values of the event_date
and event_type
columns, and a description of the holiday, if any, the event occurred
on.
Map-side joins can be tricky to configure and use properly. Here are a few pointers.
Set the property hive.auto.convert.join
to true
in your Hive config and Hive will automatically try to convert the join to a map-side join, as long as the table fits below a certain size threshold. You can configure the maximum size with the property hive.smalltable.filesize
. This will tell Hive what file size (or below) constitutes a small table. It's written in bytes expressed as a long (for example, 25000000L = 25M).
Also consider setting hive.hashtable.max.memory.usage
, which tells the map task to terminate if it requires more than the configured memory percentage.