This recipe will show a very simple analytic that uses Hive to count fatalities for every month appearing in the dataset and print the results to the console.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the
Nigera_ACLED_cleaned.tsv
dataset loaded into a Hive table named acled_nigeria_cleaned
with the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe acled_nigeria_cleaned
You should see the following response:
OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int
Follow the steps to use Hive for report generation:
SELECT from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'), 'yyyy-MMM'), COALESCE(CAST(sum(fatalities) AS STRING), 'Unknown') FROM acled_nigeria_cleaned GROUP BY from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'),'yyyy-MMM'),
monthly_violence_totals.sql
in the active folder.–f
option to the Hive client. You should see the following three rows appear first in the output console. Note that the output is sorted lexicographically, and not on the order of dates.OK 1997-Apr 115 1997-Aug 4 1997-Dec 26
The SELECT
statement uses
unix_timestamp()
and from_unixtime()
to reformat the event_date
for each row as just a year-month concatenated field.
This is also in the GROUP BY
expression for totaling fatalities using sum()
.
The
coalesce()
method returns the first non-null argument passed to it. We pass as the first argument, the value of fatalities summed for that given year-month, cast as a string. If that value is NULL
for any reason, return the constant Unknown
. Otherwise return the string representing the total fatalities counted for that year-month combination. Print everything to the console over stdout
.
The following are some additional helpful tips related to the code in this recipe:
As mentioned in the Hive documentation,
coalesce()
supports one or more arguments. The first non-null argument will be returned. This can be useful for evaluating several different expressions for a given column before deciding the right one to choose.
The coalesce()
will return NULL
if no argument is non-null. It's not uncommon to provide a type literal to return if all other arguments are NULL
.