We used the Oozie coordinator in Chapter 1, Meet Hunk, to import massive amounts of data. Data is partitioned by date and stored in binary format with a schema. It looks like a production-ready approach. Avro is pretty well supported across the whole Hadoop ecosystem. Now we are going to create a custom application using that data. Have a look at the description of the data.
Here is a description of the data stored in the base table:
The following screenshot is from the site hosting the dataset:
The idea of this dataset is to divide the city in to equal areas and map the typed subscriber activity on these regions. The assumption is that such mapping can give insights about relations between the hour of the day, type of activity, and area of the city.
There are two datasets. The first one contains customer activity (CDR). The second dataset looks like a dictionary. It has exact coordinates for each activity square represented in the earlier screenshot.
You should refer to the section Setting up virtual index for data stored in Hadoop in Chapter 2, Explore Hadoop Data with Hunk. Virtual index creation for Milano CDR is there. You have the CDR layout in HDFS:
[cloudera@quickstart ~]$ hadoop fs -ls -R /masterdata/stream/milano_cdr drwxr-xr-x - cloudera supergroup 0 2015-08-03 13:55 /masterdata/stream/milano_cdr/2013 drwxr-xr-x - cloudera supergroup 0 2015-03-25 02:13 some output deleted to reduce size of it. -rw-r--r-- 1 cloudera supergroup 69259863 2015-03-25 02:13 /masterdata/stream/milano_cdr/2013/12/07/part-m-00000.avro
So you have relatively compact aggregated data for the first seven days of December 2013. Let's create a virtual index for December 1, 2013. It should have these settings:
Property name |
Value |
---|---|
Name |
|
Path to data in HDFS |
|
Provider |
Choose Hadoop hunk provider from the dropdown list |
Explore it and check that you see this sample data:
There is a file that provides longitude and latitude values for squares. We need to create a virtual index on top of this dictionary, to be joined later with the aggregated data. We need actual coordinates to display squares on Google Maps.
Virtual index settings for the so-called geojson
should be:
Property name |
Value |
---|---|
Name |
|
Path to data in HDFS |
|
Provider |
Choose Hadoop hunk provider from the dropdown list |
Let's try to explore some data from that virtual index:
You have to scroll down and verify the advanced settings for the index. The names and values should be:
Property name |
Value |
---|---|
Name |
|
DATETIME_CONFIG |
|
SHOULD_LINEMERGE |
|
NO_BINARY_CHECK |
|
disabled |
|
pulldown_type |
|
Save the settings with this name: scv_with_comma_and_title
.
Verify that you can see lines with longitude, latitude, and squares.
Use the search application to verify that the virtual index is set correctly. Here is a search query; it selects several fields from the index:
index="geojson" | fields square, lon1,lat1 | head 10
We would like to shorten the feedback loop while developing our application. Let's trim the source data so our queries work faster.
You need to open the Pig editor http://quickstart.cloudera:8888/pig/#scripts
and open the script stored there:
Then you should see the script. Click the Submit button to run the script and create a sample dataset:
rmf /masterdata/stream/milano_cdr_sample
REGISTER 'hdfs:///user/oozie/share/lib/lib_20141218043440/sqoop/avro-mapred-1.7.6-cdh5.3.0-hadoop2.jar'
data = LOAD '/masterdata/stream/milano_cdr/2013/12/01' using AvroStorage();
time_interval
is equal to 1385884800000L
:filtered = FILTER data by time_interval == 1385884800000L;
store filtered into '/masterdata/stream/milano_cdr_sample' using AvroStorage();
Have a look at the HUE UI. This is an editor for Pig scripts. You should find a ready-to-use script on the VM. The only thing you need to do is click the Submit button:
This script reads the first day of the dataset and filters it by using the time_interval
field. This approach significantly reduces the amount of data. You should get the output in a few minutes:
Input(s): Successfully read 4637377 records (91757807 bytes) from: "/masterdata/stream/milano_cdr/2013/12/01" Output(s): Successfully stored 35473 records (2093830 bytes) in: "/masterdata/stream/milano_cdr_sample" Counters: Total records written : 35473 Total bytes written : 2093830 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
There will be more lines in output; just try to find the ones we've mentioned. They say that Pig reads 4.6 million records and stores 35 thousand records. We reduced the amount of data for testing purposes, as described earlier.
Now create a virtual index over the sample data; we can use it while developing our application. Here are the settings; use these to facilitate development:
Property name |
Value |
---|---|
Name |
|
Path to data in HDFS |
|
Provider |
Choose Hadoop hunk provider from the dropdown list |
Use the search application to check so that you can correctly access the sample data:
index="milano_cdr_sample" | head 10
You should see something similar to this: