Sorting data is a common data transformation technique. In this recipe, we will demonstrate the method of writing a simple Pig script to sort a dataset using the distributed processing power of the Hadoop cluster.
You will need to download/compile/install the following:
apache_nobots_tsv.txt
from http://www.packtpub.com/supportPerform the following steps to sort data using Apache Pig:
nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);
timestamp
field in the ascending order:ordered_weblogs = ORDER nobots BY timestamp;
STORE ordered_weblogs INTO '/user/hadoop/ordered_weblogs';
$ pig –f ordered_weblogs.pig
Sorting data in a distributed, share-nothing environment is non-trivial. The Pig relational operator ORDER BY
has the capability to provide total ordering of a dataset. This means any record that appears in the output file part-00000
, will have a timestamp less than the timestamp in the output file part-00001
(since our data was sorted by timestamp
).
The Pig ORDER BY
relational operator sorts data by multiple fields, and also supports sorting data in the descending order. For example, to sort the
nobots
relationship by the ip
and
timestamp
fields, we would use the following expression:
ordered_weblogs = ORDER nobots BY ip, timestamp;
To sort the nobots
relationship by timestamp
in the descending order, use the desc
option:
ordered_weblogs = ORDER nobots timestamp desc;