Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using Apache Pig to sort web server log data by timestamp

Sorting data is a common data transformation technique. In this recipe, we will demonstrate the method of writing a simple Pig script to sort a dataset using the distributed processing power of the Hadoop cluster.

Getting ready

You will need to download/compile/install the following:

Version 0.8.1 or better of Apache Pig from http://pig.apache.org/
Test data: apache_nobots_tsv.txt from http://www.packtpub.com/support

How to do it...

Perform the following steps to sort data using Apache Pig:

First load the web server log data into a Pig relation:

nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);

Next, order the web server log records by the timestamp field in the ascending order:
```
ordered_weblogs = ORDER nobots BY timestamp;
```

Finally, store the sorted results in HDFS:

STORE ordered_weblogs INTO '/user/hadoop/ordered_weblogs';

Run the Pig job:
```
$ pig –f ordered_weblogs.pig
```

How it works...

Sorting data in a distributed, share-nothing environment is non-trivial. The Pig relational operator ORDER BY has the capability to provide total ordering of a dataset. This means any record that appears in the output file part-00000, will have a timestamp less than the timestamp in the output file part-00001 (since our data was sorted by timestamp).

There's more...

The Pig ORDER BY relational operator sorts data by multiple fields, and also supports sorting data in the descending order. For example, to sort the nobots relationship by the ip and timestamp fields, we would use the following expression:

ordered_weblogs = ORDER nobots BY ip, timestamp;

To sort the nobots relationship by timestamp in the descending order, use the desc option:

ordered_weblogs = ORDER nobots timestamp desc;

Table of Contents for
Using Apache Pig to sort web server log data by timestamp

Using Apache Pig to sort web server log data by timestamp

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Using Apache Pig to sort web server log data by timestamp

Create new playlist

Sign In

Sign Up

Using Apache Pig to sort web server log data by timestamp

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Using Apache Pig to sort web server log data by timestamp