Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A mini project on AWS Data Lake

In this section, we will build a completely new data lake using the AWS Data Lake solution. We will first review the business use case; then we will build the cluster, ingest data, and process the data; finally, we will analyze it using QuickSight.

Mini use case business context

For this mini use case, we are going to analyze air quality data from various states in the USA and see whether there is any relationship between population trends and air quality over time. Let's review the source datasets for this project.

Air quality index

The Environmental Protection Agency (EPA) calculates the air quality index based on the concentration of pollutants. The following are the key measures tracked to determine air quality:

Ozone (O₃)
Carbon monoxide (CO)
Sulfur dioxide (SO₂)
Nitrogen dioxide (NO₂)
Inhalable particulates (PM₁₀)
Fine particulates (PM_2.5)
Lead (Pb)

To make analysis easier, EPA tracks the number of days in a year with good air quality for every major city in USA and this is tracked every year. For further details about the EPA air quality report, visit this site: https://www.epa.gov/outdoor-air-quality-data/about-air-data-reports#con

Census population

The US Census Bureau Population Estimates Program produces estimates for its cities, counties, and states by year. This information is available in the US Census Bureau website here: http://www.census.gov/programs-surveys/popest.html

Deploying AWS Data Lake using CloudFormation

AWS Data Lake solution is based on CloudFormation (https://aws.amazon.com/cloudformation/).

Follow the next steps to deploy a new data lake solution in your account.

Creating a new stack

Here are the steps to launch a new stack:

Log in to the AWS management console and select CloudFormation under the management services as shown here:

Figure 7.2: The CloudFormation service
Next, click on Create New Stack as shown in the following screenshot:

Figure 7.3: Create new stack
Select the option to upload a template to Amazon S3. This data lake template can be found at https://s3.amazonaws.com/solutions-reference/data-lake-solution/latest/data-lake-deploy.template, and also on my GitHub account at https://github.com/rnadipalli/quicksight/blob/master/miniproject/config/data-lake-deploy.template. See the following screenshot to understand this:

Figure 7.4: Upload data lake template
Next, you will see the stack details screen, where you need to enter the mandatory information explained here:
- Stack name: You can set this as PolutionAnalysisDL for the pollution analysis data lake
- Administrator name: This is the user ID that will be the main administrator for the data lake
- Administrator email address: Enter a valid e-mail address associated with the administrator user
- Administrator Access IP for Elasticsearch cluster: Enter the IP address range from where the administrator(s) will access the necessary management function, for example, 106.10.172.0/0
- Send Anonymous Usage Data: Select No if you want to opt out from sending usage information to AWS
Once done, click on the Next button to move to the next screen. See the following screenshot with these options:

Figure 7.5: Create stack detailed options
You will see the Options page, where you can specify optional tags (key-value pairs) for your stack. In this case, we will skip and click on Next to continue.
Now you will see the final Review page, where you can confirm the settings. Ensure that you check the I acknowledge... checkbox and then click on Create to start the creation process:

Figure 7.6: Stack review

This completes the setup and now AWS will launch four stacks for the data lake solution. You will see the following screen after successful creation, which takes about 25 minutes:

Figure 7.7: Stack creation confirmation

This completes the deployment of the data lake; next, we will review how to access this.

Access your data lake stack

Here are the steps required to access your newly created data lake:

After the data lake is created, the administrator will receive an e-mail that contains the URL to get to the data lake console and a temporary password.
Click on the URL and you will be prompted to change the password.
You can also click on the stack listing page and then on the link for PolutionAnalysisDL. It will take you to the stack details page for this, as shown in the following screenshot, which also provides the URL for the data lake console:

Figure 7.8: DL stack details
Next, open the data lake console in a new tab and you will see the following login page:

Figure 7.9: Data lake console login

This completes the data lake access section; next, we will review the data sources and ingest them into the data lake.

Acquiring the data for the mini project

For this mini project, I have downloaded data from some public websites and saved it on GitHub at this location: https://github.com/rnadipalli/quicksight/tree/master/miniproject/datasets.

For EPA air quality data, I have seven files, one for each year from 2010 to 2016, as shown in this screenshot:

Figure 7.10: EPA Air Quality Index (AQI) datasets

For the USA population data, I have consolidated it as a single file as shown here:

Figure 7.11: US population estimates

Hydrating the data lake

Now that we have acquired the source data, we will review the steps to hydrate the data lake with these datasets. The AWS Data Lake architecture recommends that all source data should be ingested into S3 buckets. In this section, we will discuss the steps needed to ingest data to an S3 bucket created for this project: quicksight-mini-project.

Air quality index data in S3

Here are the steps to load EPA data to S3:

First download the data file from GitHub (https://github.com/rnadipalli/quicksight/tree/master/miniproject/datasets/EPAAQIReports) to your local desktop.
Create a folder under the S3 bucket and name it EPAaqidata. Create seven subfolders under it, one for each year, such as year=2010.
Upload the data files to the respective year folder.
Finally, your folder structure should look like this:

Figure 7.12: EPA AQI data in S3

US population data in S3

Here are the steps to load the US population estimates data into S3:

First, download the data file from GitHub (https://github.com/rnadipalli/quicksight/blob/master/miniproject/datasets/us_population_data_wo_header.csv) to your local desktop.
Next, create a folder under the S3 bucket and name it Populationdata.
Upload the data file under the S3 bucket location quicksight-mini-project/Populationdata folder.

This completes the hydration (data ingestion) to the AWS Data Lake; next, we will review how to catalog the datasets.

Cataloging data assets

In this section, we will review how to leverage the AWS Data Lake packaged metadata management web interface to categorize, tag, and catalog data assets.

Creating governance tags

As we onboard data to the data lake, it is important to record the business context of the data so that consumers can easily identify the data. AWS Data Lake solution makes it really easy for data stewards to specify these tags as and when datasets are registered to the data lake. For our mini project, we follow these steps to create these governance tags:

First, log in to the data lake console with the administration credentials.
From the navigation pane on the left, choose Settings under the Administration section.
Next, click on the Governance tab and select Add Tag Governance.
Enter the following tag names:
- Retention Years to track the retention period of the dataset
- Category to track the category of the dataset EPA or Census
- Data Steward to track the username of the data steward
- PII Indicator to track if there is any Personal Identifiable Information (PII) in the datasets
Click on Save to update the data lake governance settings.

This completes the governance settings as shown in the following screenshot:

Figure 7.13: Governance tag setup

Registering data packages

In this section, we will review how to register the two data packages for this mini project with AWS Data Lake solution.

EPA AQI data package

Let's see the steps required to create the EPA AQI data package:

First, log in to the data lake console with the administration credentials.
From the navigation pane on the left, choose Create a Package under the Content section.
Next, enter the package name, description, and governance tags.
Click on the Create package button and this creates the new package.
The following screenshot shows the Create a Package screen for the EPA dataset:

Figure 7.14: EPA AQI package registration basic tab

Next, follow these steps to add content to the newly created package:

Click on the Content tab of the EPA AQI package.

Create a manifest file that has a list of all S3 files for EPA AQI data, as follows:

        { 
          "fileLocations": [ 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2010/aqireport2010.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2011/aqireport2011.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2012/aqireport2012.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2013/aqireport2013.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2014/aqireport2014.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2015/aqireport2015.csv" 
            }, 
            { 
              "url": "s3://quicksight-mini-
                project/EPAaqidata/year=2016/aqireport2016.csv" 
            } 
          ] 
        }

Upload the manifest file just created to the content. For your convenience, I have uploaded this file on GitHub at https://github.com/rnadipalli/quicksight/blob/master/miniproject/config/EPAAQIManifest.json.
Next, click on Save as shown in the following screenshot:

Figure 7.15: EPA AQI S3 manifest file upload
After successfully linking the S3 files to the package, you will see the following screen, with a list of all seven files imported from the S3 manifest file:

Figure 7.16: PA AQI data content review
You can optionally review the History tab of the package to see all activity for that package, as shown here:

Figure 7.17: Package history
This completes the registration of the EPA AQI package to AWS Data Lake.

USA population history package

Similar to the EPA AQI package, you can create the USA population history package. I will highlight the key steps for this package:

From the navigation pane on the left, choose Create a Package under the Content section.
Next, enter the package name, description, and governance tags as shown in the following screenshot:

Figure 7.18: USA population create package
Next, link the S3 files to the package using the following manifest file: https://github.com/rnadipalli/quicksight/blob/master/miniproject/config/PopulationHistoryManifest.json.
Click on Save to complete registration.

This completes the registration of the USA population package to AWS Data Lake.

Searching the data catalog

Once the data packages are registered to the data lake, the AWS solution automatically indexes this data and provides a very easy-to-use search interface. This enables data consumers to search packages available in the data lake based on their interest. Let's review the key steps to use search:

First log in to the data lake console.
From the navigation pane on the left, choose Search under the Content section.
Enter the search term in the provided textbox. You can use * as wild character if needed.
Click on the Search button to submit your query and see the results, as shown in the following screenshot:

Figure 7.19: Search catalog

From the results list, you can click on the Edit icon to go straight to the package and edit if needed. This completes the data package registration and search section.

Extracting packages using manifest

Once you've found some data that you like, you can add it to the cart and generate manifest files with secure access links to the desired content. Let's review the steps for this:

From the search results, first select the packages that you want to extract to your cart such as a shopping experience.
Next, click on the Cart icon from the Cart menu on the left navigation menu.
To check out your items, click on the My Cart icon seen on the top right.
Finally, click on Generate Manifest to generate the manifest file.

One example of using this manifest metadata is to export data from S3 to Redshift database using a script leveraging the manifest file. Next, let's look at how to process data in a data lake.

Processing data in the AWS Data Lake

In this section, we will review how to prepare the data that is now registered with the data lake and get some insights from it.

Creating Athena tables

There are several options to process the data; for this mini project, we will leverage Athena to create tables on the centralized data store S3, using the following steps:

Open the AWS management console for Athena using this link: https://console.aws.amazon.com/athena/home. Alternatively, search for Athena in the AWS services search bar.
Using the Query Editor, run the create database statement as shown in the next screenshot with the query:
```
        CREATE DATABASE polutionanalysisdb; 
```

Next, create a new table for the EPA AQI package in S3 with a partition clause:

        CREATE EXTERNAL TABLE IF NOT EXISTS
          polutionanalysisdb.epaaqi_raw ( 
          `City_Code` int, 
          `City_Name` string, 
          `Days_with_AQI` int, 
          `Good_Days` int, 
          `Moderate_Days` int, 
          `Unhealthy_for_Sensitive_Days` int, 
          `Unhealthy_Days` int, 
          `Very_Unhealthy_Days` int, 
          `AQI_Maximum` int, 
          `AQI_90th_Percentile` int, 
          `AQI_Median` int, 
          `CO_Days` int, 
          `NO2_Days` int, 
          `O3_Days` int, 
          `SO2_Days` int, 
          `PM25_Days` int, 
          `PM10_Days` int 
        ) 
        PARTITIONED BY (year string) 
        ROW FORMAT SERDE 
          'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
        WITH SERDEPROPERTIES ( 
          'serialization.format' = ',', 
          'field.delim' = '|' 
        ) LOCATION 's3://quicksight-mini-project/EPAaqidata/';

The preceding query is on GitHub, at https://github.com/rnadipalli/quicksight/blob/master/miniproject/sqlscripts/athenaddl.sql.
Next, to load all partitions of the table, run the following command:
```
        MSCK REPAIR TABLE polutionanalysisdb.epaaqi_raw; 
```
You can now query the table and view data as shown here:

Figure 7.20: EPA AQI report query

Next, create a new table for the history of USA population using the following SQL statement, which is also present on GitHub at https://github.com/rnadipalli/quicksight/blob/master/miniproject/sqlscripts/athenaddl.sql:

        CREATE EXTERNAL TABLE IF NOT EXISTS 
          polutionanalysisdb.population_raw ( 
          SUMLEV int, 
          REGION int, 
          DIVISION int, 
          STATE string, 
          COUNTY string, 
          STNAME string, 
          CTYNAME string, 
          CENSUS2010POP int, 
          ESTIMATESBASE2010 int, 
          POPESTIMATE2010 int, 
          POPESTIMATE2011 int, 
          POPESTIMATE2012 int, 
          POPESTIMATE2013 int, 
          POPESTIMATE2014 int, 
          POPESTIMATE2015 int, 
          NPOPCHG_2010 int, 
          NPOPCHG_2011 int, 
          NPOPCHG_2012 int, 
          NPOPCHG_2013 int, 
          NPOPCHG_2014 int, 
          NPOPCHG_2015 int, 
          BIRTHS2010 int, 
          BIRTHS2011 int, 
          BIRTHS2012 int, 
          BIRTHS2013 int, 
          BIRTHS2014 int, 
          BIRTHS2015 int, 
          DEATHS2010 int, 
          DEATHS2011 int, 
          DEATHS2012 int, 
          DEATHS2013 int, 
          DEATHS2014 int, 
          DEATHS2015 int, 
          NATURALINC2010 int, 
          NATURALINC2011 int, 
          NATURALINC2012 int, 
          NATURALINC2013 int, 
          NATURALINC2014 int, 
          NATURALINC2015 int, 
          INTERNATIONALMIG2010 int, 
          INTERNATIONALMIG2011 int, 
          INTERNATIONALMIG2012 int, 
          INTERNATIONALMIG2013 int, 
          INTERNATIONALMIG2014 int, 
          INTERNATIONALMIG2015 int, 
          DOMESTICMIG2010 int, 
          DOMESTICMIG2011 int, 
          DOMESTICMIG2012 int, 
          DOMESTICMIG2013 int, 
          DOMESTICMIG2014 int, 
          DOMESTICMIG2015 int, 
          NETMIG2010 int, 
          NETMIG2011 int, 
          NETMIG2012 int, 
          NETMIG2013 int, 
          NETMIG2014 int, 
          NETMIG2015 int, 
          RESIDUAL2010 int, 
          RESIDUAL2011 int, 
          RESIDUAL2012 int, 
          RESIDUAL2013 int, 
          RESIDUAL2014 int, 
          RESIDUAL2015 int, 
          GQESTIMATESBASE2010 int, 
          GQESTIMATES2010 int, 
          GQESTIMATES2011 int, 
          GQESTIMATES2012 int, 
          GQESTIMATES2013 int, 
          GQESTIMATES2014 int, 
          GQESTIMATES2015 int, 
          RBIRTH2011 float, 
          RBIRTH2012 float, 
          RBIRTH2013 float, 
          RBIRTH2014 float, 
          RBIRTH2015 float, 
          RDEATH2011 float, 
          RDEATH2012 float, 
          RDEATH2013 float, 
          RDEATH2014 float, 
          RDEATH2015 float, 
          RNATURALINC2011 float, 
          RNATURALINC2012 float, 
          RNATURALINC2013 float, 
          RNATURALINC2014 float, 
          RNATURALINC2015 float, 
          RINTERNATIONALMIG2011 float, 
          RINTERNATIONALMIG2012 float, 
          RINTERNATIONALMIG2013 float, 
          RINTERNATIONALMIG2014 float, 
          RINTERNATIONALMIG2015 float, 
          RDOMESTICMIG2011 float, 
          RDOMESTICMIG2012 float, 
          RDOMESTICMIG2013 float, 
          RDOMESTICMIG2014 float, 
          RDOMESTICMIG2015 float, 
          RNETMIG2011 float, 
          RNETMIG2012 float, 
          RNETMIG2013 float, 
          RNETMIG2014 float, 
          RNETMIG2015 float 
        )  
        ROW FORMAT SERDE 
          'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
        WITH SERDEPROPERTIES ( 
          'serialization.format' = ',', 'field.delim' = ',' )
        LOCATION 's3://quicksight-mini-project/Populationdata/';

Verify that the table is working by running a simple Select statement on the new table.

This completes the registration of tables in Athena; next, we will see how to analyze this data using QuickSight.

Analyzing using QuickSight

As they say, the best comes last. To finally make sense of the data and build interesting reports, we will use QuickSight for this mini project.

Population analysis

In this section, we will build reports to understand the impact of population on air quality over time. First, we will analyze and report the population trend and then overlay the EPA air quality data to see if there is a relationship between the data.

Creating the population dataset

Follow these steps to create a new dataset for population analysis:

From the Manage data, create a new dataset of type Athena and set the name as population_raw. Enter the data source name as polutionanalysisdb, the same as the Athena database name.
The raw dataset is a flat structure with one row per city/state and different columns for population estimates for each year. With the custom SQL option, we will transform this data into multiple rows for easier reporting.

In the Data source section, select the Query option and enter the following custom SQL:

        select stname, county, ctyname, '2010' as year, 
          census2010pop as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0' 
        union 
        select stname, county, ctyname, '2011' as year, 
          popestimate2011 as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0' 
        union 
        select stname, county, ctyname, '2012' as year, 
          popestimate2012 as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0' 
        union 
        select stname, county, ctyname, '2013' as year, 
          popestimate2013 as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0' 
        union 
        select stname, county, ctyname, '2014' as year, 
          popestimate2014 as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0' 
        union 
        select stname, county, ctyname, '2015' as year, 
          popestimate2015 as populationcount 
        from polutionanalysisdb.population_raw 
        where county != '0'

It should now show you results like this:

Figure 7.21: Dataset for population
Click on Finish to save this new dataset.
Now your dataset is ready for analysis.

Insights from population dataset

In this section, we will visualize data from population data. Follow these steps to create charts:

Create a new analysis using the population_raw dataset created in the previous section.
Next, we will create a bar chart to see the population count by state for the year 2015. For this, we will select the Visual type as Vertical stacked bar chart and then select stname as X axis, Populationcount (sum) as Y axis, and year for Group/Color.
Next, we filter the year to 2015 so that we get the latest population estimates, as shown in this screenshot:

Figure 7.22: Population bar chart
This shows that California is the most populous state, followed by Texas.
Next, let's see the trend of population over time using a line chart, as shown here:

Figure 7.23: Population trend

This completes our insights on population. Next, let's see if the population has any impact on the air quality index.

Combining population and EPA datasets

To complete our initial quest, we need to combine the population data and the EPA AQI data using the following steps:

From the Manage data, create a new dataset of type Athena and set the name as epaaqi_raw. Enter data source name as polutionanalysisdb2.
For this analysis, we will select the city of Austin, Texas, and see how the population increase has impacted the air quality.

For this, we will join the two datasets using the following SQL statement, which is also available on GitHub at https://github.com/rnadipalli/quicksight/blob/master/miniproject/sqlscripts/QuickSight-custom-queries.sql:

        select polr.year, popr.ctyname, popr.census2010pop as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2010' 
        UNION 
        select polr.year, popr.ctyname, popr.POPESTIMATE2011 as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2011' 
        UNION 
        select polr.year, popr.ctyname, popr.POPESTIMATE2012 as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2012' 
        UNION 
        select polr.year, popr.ctyname, popr.POPESTIMATE2013 as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2013' 
        UNION 
        select polr.year, popr.ctyname, popr.POPESTIMATE2014 as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2014' 
        UNION 
        select polr.year, popr.ctyname, popr.POPESTIMATE2015 as 
          PopulationCount , 
          polr.Good_Days, polr.Moderate_Days, 
            polr.Unhealthy_for_Sensitive_Days, polr.Unhealthy_Days, 
              polr.Very_Unhealthy_Days 
        from polutionanalysisdb.population_raw popr, 
          polutionanalysisdb.epaaqi_raw polr 
        where popr.stname = 'Texas' 
        and   popr.ctyname like 'Austin%' 
        and   polr.city_name LIKE 'Austin%' 
        and   polr.year = '2015' 
        order by 1

Next, click on Finish to save this dataset and then click on Save & visualize.

We will now see how to visualize this data.

EPA Trend with population impact

In this section, we will visualize data from the new complex dataset. The first chart that we will create is a simple bar chart with the population trend for Austin over 6 years, as shown here:

Figure 7.24: EPA and population chart 1

We will create a line chart to show the air quality over time for the same years:

Figure 7.25: EPA and population chart 2

We can see that the increase in population from 2010 to 2015 does seem to have an impact on the air quality index in the city of Austin, Texas. The number of good days has decreased from 268 in 2010 to 239 in 2015; at the same time the population has increased from 28.4 million to 29.5 million.

This now completes our end-to-end AWS Data Lake solution. We built a new AWS Data Lake, hydrated it with source data onto S3, cataloged the metadata, built Athena tables and analyzed using QuickSight. You can leverage this architecture for any other real-life project at your organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for A mini project on AWS Data Lake

Create new playlist

Sign In

Sign Up