In this section, we will look into the data flow path from the US census to QuickSight using a Redshift data store. The source data is obtained from the public site and provides information about household income and population in the USA by zip code. The path to get to QuickSight involves the following steps:
The following are the pre-requisites to load data from Redshift to QuickSight:
SELECT
permission on some system tables so that QuickSight can estimate the table size. These tables are pg_stats
, pg_class
, and pg_namespace
.For this demonstration, we will use the file USACensusSalarybyZip.csv
which has median and mean household salary by zip code. This will help us understand if the college tuition fees are affordable by the local residents. The source for this data is from the University of Michigan, Population Studies Center website, which gets this from the Census Bureau (http://www.psc.isr.umich.edu/dis/census/Features/tract2zip/).
Figure 2.19: University of Michigan Census site
The preceding screenshot shows the University of Michigan Population Studies Center web page with the download link highlighted.
The sample files are also in GitHub (edited and removed header line and converted text to number) https://github.com/rnadipalli/QuickSight/tree/master/sampledata.
Here are the detailed steps to upload a file to an S3 filesystem:
collegescorecard
bucket that I created earlier.This completes the upload of the USACensusSalarybyZip.csv
file to S3.
Next, to load the data from S3 to a Redshift instance, we will use the AWS Data Pipeline service. From the Services list, click on Data Pipeline and then click on Create new pipeline.
The next screenshot shows the Create Pipeline form, enter the details as follows:
zip varchar(20)
, medianincome int
, meanincome int
, populationcount int
.
Figure 2.20: AWS pipeline for S3 to Redshift
For reference, you can import the pipeline I have saved in GitHub at this location: https://github.com/rnadipalli/QuickSight/blob/master/awsdatapipelines/LoadS3SalarytoRedshiftPipeline.json.
Now that data is in the Redshift instance, follow these steps to load it to SPICE:
Figure 2.21 Load Redshift to SPICE
Figure 2.22 Redshift table selection
Figure 2.23: Redshift prepare data
Figure 2.24: Dataset preview
21771
has a population of 26,023
and average median salary of $105,144.
Figure 2.25: Redshift data visualization