Datafu is a Pig UDF library open sourced by the SNA
team at LinkedIn. It contains many useful functions. This recipe will use
play counts from the Audioscrobbler
dataset and the
Quantile UDF from datafu to identify and remove outliers.
datafu-0.0.4/dist/ datafu-0.0.4.jar
file to a location accessible by Pig.Audioscrobbler
dataset from http://www.packtpub.com/support.Quantile
UDF:register /path/to/datafu-0.0.4.jar; define Quantile datafu.pig.stats.Quantile('.90'),
user_artist_data.txt
file:plays = load '/data/audioscrobbler.txt'using PigStorage(' ') as (user_id:long, artist_id:long, playcount:long);
plays_grp = group plays ALL;
out_max = foreach plays_grp{ ord = order plays by playcount; generate Quantile(ord.playcount) as ninetieth ; }
trim_outliers = foreach plays generate user_id, artist_id, (playcount>out_max.ninetieth ? out_max.ninetieth : playcount);
user_artist_data.txt
file with outliers trimmed:store trim_outliers into '/data/audioscrobble/outliers_trimmed.bcp';
This recipe takes advantage of the datafu library open sourced by LinkedIn. Once a JAR file is registered, all of its UDFs are available to the Pig script. The define
command calls the constructor of the datafu.pig.stats.Quantile
UDF passing it a value of .90
. The constructor of the Quantile
UDF will then create an instance that will produce the ninetieth percentile of the input vector it is passed. The define
also aliases Quantile
as shorthand for referencing this UDF.
The user artist data is loaded into a relation named plays
. This data is then grouped by ALL
. The ALL
group is a special kind of group that creates a single bag containing all of the input.
The
Quantile
UDF requires that the data it has passed be sorted first. The data is sorted by play count, and the sorted play count's vector is passed to the Quantile
UDF. The sorted play count simplifies the job of the Quantile
UDF. It now picks the value at the ninetieth percentile position and returns it.
This value is then compared against each of the play counts in the user artist file. If the play count is greater, it is trimmed down to the value returned by the Quantile
UDF, otherwise the value remains as it is.
The updated user artist file with outliers trimmed is then stored back in HDFS to be used for further processing.
The datafu library also includes a StreamingQuantile
UDF. This UDF is similar to the Quantile
UDF except that it does not require the data to be sorted before it is used. This will greatly increase the performance of this operation. However, it does come at a cost. The StreamingQuantile
UDF only provides an estimation of the values.
define Quantile datafu.pig.stats.StreamingQuantile('.90'),