This recipe explains how to carry out join and sort operations with Pig.
This sample will use two datasets. The first dataset has the Gross National Income (GNI) per capita by country, and the second dataset has the exports of the country as a percentage of its gross domestic product.
This recipe will use Pig to process the dataset and create a list of countries that have more than 2000$ of gross national income per capita sorted by the GNI value, and then join them with the export dataset.
This recipe needs a working Pig installation. If you have not done it already, follow the earlier recipe and install Pig.
This section will describe how to use Pig to join two datasets.
PIG_HOME
.resources/chapter5/hdi-data.csv
and resources/chapter5/ /export-data.csv
to PIG_HOME/bin
.resources/chapter5/countryJoin.pig
script to PIG_HOME/bin
.countryJoin.pig
with your favorite editor. The script countryJoin.pig
joins the HDI data and export data together. Pig calls its script "Pig Latin scripts".A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni> 2000; C = ORDER B BY gni; D = load 'export-data.csv' using PigStorage(',') AS (country:chararray, expct:float); E = JOIN C BY country, D by country; dump E;
The first and forth lines load the data from CSV files. As described in the earlier recipe, PigStorage(',')
asks pig to use ','
as the separator and assigns the values to the described fields in the command.
Then the fifth line joins the two datasets together.
PIG_HOME
directory.>.bin/pig -x local bin/countryJoin.pig (51,Cuba,0.776,79,9,17,5416,Cuba,19.613546) (100,Fiji,0.688,69,10,13,4145,Fiji,52.537148) (132,Iraq,0.573,69,5,9,3177,Iraq,) (89,Oman,0.705,73,5,11,22841,Oman,) (80,Peru,0.725,74,8,12,8389,Peru,25.108027) (44,Chile,0.805,79,9,14,13329,Chile,38.71985) (101,China,0.687,73,7,11,7476,China,29.571701) (106,Gabon,0.674,62,7,13,12249,Gabon,61.610462) (134,India,0.547,65,4,10,3468,India,21.537624) ...
When we run the Pig script, Pig will convert the pig script to MapReduce jobs and execute them. As described with the Pig Latin script, Pig will load the data from the CSV files, run transformation commands, and finally join the two data sets.
Pig supports many other operations and built-in functions. You can find details about the operations from http://pig.apache.org/docs/r0.10.0/basic.html and details about built-in functions from http://pig.apache.org/docs/r0.10.0/func.html.