This recipe will illustrate the efficient use of the Hive date UDFs to list the 20 most recent events and the number of days between the event date and the current system date.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the Nigera_ACLED_cleaned.tsv
dataset loaded into a Hive table named acled_nigeria_cleaned
with the fields mapped to the respective datatypes.
Issue the following command to the Hive client to see the mentioned fields:
describe acled_nigeria_cleaned
You should see the following response:
OK Loc string event_date string event_type string actor string latitude double longitude double source string fatalities int
Perform the following steps to utilize Hive UDFs for sorting and transformation:
SELECT event_type,event_date,days_since FROM ( SELECT event_type,event_date, datediff(to_date(from_unixtime(unix_timestamp())), to_date(from_unixtime( unix_timestamp(event_date, 'yyyy-MM-dd')))) AS days_since FROM acled_nigeria_cleaned) date_differences ORDER BY event_date DESC LIMIT 20;
top_20_recent_events.sql
in the active folder.–f
option to the Hive client. You should see the following five rows appear first in the output console:OK Battle-No change of territory 2011-12-31 190 Violence against civilians 2011-12-27 194 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196
Let's start with the nested SELECT
subqueries. We select three fields from our Hive table acled_nigeria_cleaned
: event_type
, event_date
, and the result of calling the UDF datediff()
, which takes as arguments an end date and a start date. Both are expected in the form yyyy-MM-dd. The first argument to datediff()
is the end date, with which we want to represent the current system date. Calling unix_timestamp()
with no arguments will return the current system time in milliseconds. We send that return value to from_unixtimestamp()
to get a formatted timestamp representing the current system date in the default Java 1.6 format (yyyy-MM-dd HH:mm:ss). We only care about the date portion, so calling to_date()
with the output of this function strips the HH:mm:ss. The result is the current date in the yyyy-MM-dd form.
The second argument to
datediff()
is the start date, which for our query is the event_date
. The series of function calls operate in almost the exact same manner as our previous argument, except that when we call unix_timestamp()
, we must tell the function that our argument is in the SimpleDateFormat
format that is yyyy-MM-dd. Now we have both start_date
and end_date
arguments in the yyyy-MM-dd format and can perform the datediff()
operation for the given row. We alias the output column of datediff()
as days_since
for each row.
The outer SELECT
statement takes these three columns per row and sorts the entire output by event_date
in descending order to get reverse chronological ordering. We arbitrarily limit the output to only the first 20.
The net result is the 20 most recent events with the number of days that have passed since that event occurred.