This recipe will use Pig to group the IP addresses
contained in the ip_to_country
dataset and count the number of IP addresses listed for each country.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Pig 0.9.2 installed on your client machine and on the environment path for the active user account. This recipe depends on having the ip-to-country
named dataset included in the book loaded into HDFS at the absolute path /input/weblog_ip/ip_to_country.txt
.
Carry out the following steps to perform a SELECT
and GROUP BY
operation in Pig:
ip_countries = LOAD '/input/weblog_ip/ip_to_country.txt' AS (ip: chararray, country:chararray); country_grpd = GROUP ip_countries BY country; country_counts = FOREACH country_grpd GENERATE FLATTEN(group), COUNT(ip_countries) as counts; STORE country_counts INTO '/output/geo_weblog_entries';
group_by_country.pig
.–f
option.The first line creates a Pig relation named ip_countries
from the tab-delimited records stored in HDFS. The relation specifies two attributes, namely ip
and country
, both character arrays. The second line creates the country_grpd
relation containing a record for each distinct country in the ip_countries
relation. The third line tells Pig to iterate over the country_grpd
relation and count the number of records in the ip_countries
relation that map to the current country. The results of this iteration are persisted to a new relation named country_counts
, which consists of tuples
containing exactly two attributes, namely group
and counts
. Store the tuples contained in this relation to the output directory specified by /output/geo_weblog_entries
.
The output is not sorted in country
in the ascending or descending order.
You should see in HDFS, under /output/geo_weblog_entries
, one or more part files containing tab-delimited country listings and their IP address counts.