In this recipe, we will use Python to create a simple Apache Pig user-defined function (UDF) to count the number of records in a Pig DataBag.
You will need to download/compile/install the following:
apache_nobots_tsv.txt
from http://www.packtpub.com/supportThis recipe requires the Jython standalone JAR file. To build the file, download the Jython java installer, run the installer, and select Standalone from the installation menu.
$ java –jar jython_installer-2.5.2.jar
Add the Jython standalone JAR file to Apache Pig's classpath:
$ export PIG_CLASSPATH=$PIG_CLASSPATH:/path/to/jython2.5.2/jython.jar
The following are the steps to create an Apache Pig UDF using Python:
#!/usr/bin/python @outputSchema("hits:long") def calculate(inputBag): hits = len(inputBag) return hits
register 'count.py' using jython as count; nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray); ip_page_groups = GROUP nobots_weblogs BY (ip, page); ip_page_hits = FOREACH ip_page_groups GENERATE FLATTEN(group), count.calculate(nobots_weblogs); STORE ip_page_hits INTO '/user/hadoop/ip_page_hits';
First, we created a simple Python function to
calculate the length of a Pig DataBag. In addition, the Python script contained the Python decorator, @outputSchema("hits:long")
, that instructs Pig on how to interpret the data returned by the Python function. In this case, we want Pig to store the data returned by this function as a Java Long in a field named hits
.
Next, we wrote a Pig script that registers the Python UDF using the statement:
register 'count.py' using jython as count;
Finally, we called the calculate()
function using the alias count
, in the Pig DataBag:
count.calculate(nobots_weblogs);