Set operations (join, union) and sorting with Pig

This recipe explains how to carry out join and sort operations with Pig.

This sample will use two datasets. The first dataset has the Gross National Income (GNI) per capita by country, and the second dataset has the exports of the country as a percentage of its gross domestic product.

This recipe will use Pig to process the dataset and create a list of countries that have more than 2000$ of gross national income per capita sorted by the GNI value, and then join them with the export dataset.

Getting ready

This recipe needs a working Pig installation. If you have not done it already, follow the earlier recipe and install Pig.

How to do it...

This section will describe how to use Pig to join two datasets.

  1. Change the directory to PIG_HOME.
  2. Copy resources/chapter5/hdi-data.csv and resources/chapter5/ /export-data.csv to PIG_HOME/bin.
  3. Copy the resources/chapter5/countryJoin.pig script to PIG_HOME/bin.
  4. Load the script countryJoin.pig with your favorite editor. The script countryJoin.pig joins the HDI data and export data together. Pig calls its script "Pig Latin scripts".
    A = load 'hdi-data.csv' using PigStorage(',')  AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);
    B = FILTER A BY gni> 2000;
    C = ORDER B BY gni;
    D = load 'export-data.csv' using PigStorage(',')  AS (country:chararray, expct:float);
    E = JOIN C BY country, D by country;
    dump E;

    The first and forth lines load the data from CSV files. As described in the earlier recipe, PigStorage(',') asks pig to use ',' as the separator and assigns the values to the described fields in the command.

    Then the fifth line joins the two datasets together.

  5. Run the Pig Latin script by running the following command from the PIG_HOME directory.
    >.bin/pig -x local bin/countryJoin.pig
    (51,Cuba,0.776,79,9,17,5416,Cuba,19.613546)
    (100,Fiji,0.688,69,10,13,4145,Fiji,52.537148)
    (132,Iraq,0.573,69,5,9,3177,Iraq,)
    (89,Oman,0.705,73,5,11,22841,Oman,)
    (80,Peru,0.725,74,8,12,8389,Peru,25.108027)
    (44,Chile,0.805,79,9,14,13329,Chile,38.71985)
    (101,China,0.687,73,7,11,7476,China,29.571701)
    (106,Gabon,0.674,62,7,13,12249,Gabon,61.610462)
    (134,India,0.547,65,4,10,3468,India,21.537624)
    ...
    

How it works...

When we run the Pig script, Pig will convert the pig script to MapReduce jobs and execute them. As described with the Pig Latin script, Pig will load the data from the CSV files, run transformation commands, and finally join the two data sets.

There's more...

Pig supports many other operations and built-in functions. You can find details about the operations from http://pig.apache.org/docs/r0.10.0/basic.html and details about built-in functions from http://pig.apache.org/docs/r0.10.0/func.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset