Running your first Pig command

This recipe runs a basic Pig script. As the sample dataset, we will use Human Development Report (HDR) data by country. It shows the Gross National Income (GNI) per capita by country. The dataset can be found from http://hdr.undp.org/en/statistics/data/. This recipe will use Pig to process the dataset and create a list of countries that have more than 2000$ of gross national income per capita (GNI) sorted by the GNI value.

How to do it...

This section describes how to use Pig Latin script to find countries with 2000$ GNI sorted by the same criterion from the HDR dataset.

  1. From the sample code, copy the dataset from resources/chapter5/hdi-data.csv to PIG_HOME/bin directory.
  2. From the sample code, copy the Pig script resources/chapter5/countryFilter.pig to PIG_HOME/bin.
  3. Open the Pig script through your favorite editor. It will look like the following:
    A = load 'hdi-data.csv' using PigStorage(',')  AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);
    B = FILTER A BY gni> 2000;
    C = ORDER B BY gni;
    dump C;

    The first line instructs Pig to load the CSV (comma-separated values) file into the variable A. The PigStorage(',') portion tells Pig to load the data using ',' as the separator and assign them to the fields described in the AS clause.

    After loading the data, you can process the data using Pig commands. Each Pig command manipulates the data and creates a pipeline of data-processing commands. As each step processes the data and all dependencies are defined as data dependencies, we call Pig a Dataflow language .

    Finally the dump command prints the results to the screen.

  4. Run the Pig script by running the following command from PIG_HOME directory:
    >bin/pig-x local bin/countryFilter.pig
    

    When executed, the above script will print the following results. As expressed in the script, it will print names of countries that have a GNI value greater than 2000$, sorted by GNI.

    (126,Kyrgyzstan,0.615,67,9,12,2036)
    (156,Nigeria,0.459,51,5,8,2069)
    (154,Yemen,0.462,65,2,8,2213)
    (138,Lao People's Democratic Republic,0.524,67,4,9,2242)
    (153,Papua New Guinea,0.466,62,4,5,2271)
    (165,Djibouti,0.43,57,3,5,2335)
    (129,Nicaragua,0.589,74,5,10,2430)
    (145,Pakistan,0.504,65,4,6,2550)
    (114,Occupied Palestinian Territory,0.641,72,8,12,2656)
    (128,Viet Nam,0.593,75,5,10,2805)
    
    

How it works...

When we run the Pig script, Pig internally compiles Pig commands to MapReduce jobs in an optimized form and runs it in a MapReduce cluster. Chaining MapReduce jobs using the MapReduce interface is cumbersome, as users will have to write code to pass the output from one job to the other and detect failures. Pig translates such chaining to single-line command and handles the details internally. For complex jobs, the resulting Pig script is easier to write and manage than MapReduce commands that do the same thing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset