Loading Redshift data to QuickSight

In this section, we will look into the data flow path from the US census to QuickSight using a Redshift data store. The source data is obtained from the public site and provides information about household income and population in the USA by zip code. The path to get to QuickSight involves the following steps:

  1. Uploading data to S3.
  2. Creating an AWS Data Pipeline to load data from S3 to Redshift.
  3. Loading data from Redshift to QuickSight.

Pre-requisites

The following are the pre-requisites to load data from Redshift to QuickSight:

  • Must have a Redshift instance created.
  • Data file must be a CSV. It cannot contain a header in the data file.
  • You must have a database username and password that can connect to the database from QuickSight with the SELECT permission on some system tables so that QuickSight can estimate the table size. These tables are pg_stats, pg_class, and pg_namespace.

Uploading data to S3

For this demonstration, we will use the file USACensusSalarybyZip.csv which has median and mean household salary by zip code. This will help us understand if the college tuition fees are affordable by the local residents. The source for this data is from the University of Michigan, Population Studies Center website, which gets this from the Census Bureau (http://www.psc.isr.umich.edu/dis/census/Features/tract2zip/).

Uploading data to S3

Figure 2.19: University of Michigan Census site

The preceding screenshot shows the University of Michigan Population Studies Center web page with the download link highlighted.

Note

The sample files are also in GitHub (edited and removed header line and converted text to number) https://github.com/rnadipalli/QuickSight/tree/master/sampledata.

Here are the detailed steps to upload a file to an S3 filesystem:

  1. Download the file to your local system (laptop).
  2. Next, to upload the file to AWS S3, login to your account and from the Services menu, select S3.
  3. Next, select the S3 bucket or create a new S3 bucket. For this file I have selected collegescorecard bucket that I created earlier.
  4. Next click on the Upload button, select the local file from your system, and then upload.

This completes the upload of the USACensusSalarybyZip.csv file to S3.

Creating and executing an AWS Data Pipeline

Next, to load the data from S3 to a Redshift instance, we will use the AWS Data Pipeline service. From the Services list, click on Data Pipeline and then click on Create new pipeline.

The next screenshot shows the Create Pipeline form, enter the details as follows:

  • Name and Description: Provide an appropriate name and description.
  • Source: Select from the drop-down menu; Load data from S3 into Redshift.
  • Parameters: Provide the appropriate Redshift credentials, S3 location, insert query, table name, and the create table query. The table has the following structure: zip varchar(20), medianincome int, meanincome int, populationcount int.

    Creating and executing an AWS Data Pipeline

    Figure 2.20: AWS pipeline for S3 to Redshift

  • Next, provide schedule information or click on Activate now to kick off the pipeline now.

Note

For reference, you can import the pipeline I have saved in GitHub at this location: https://github.com/rnadipalli/QuickSight/blob/master/awsdatapipelines/LoadS3SalarytoRedshiftPipeline.json.

  • AWS will now execute this pipeline on an EC2 instance that you have specified at configuration and the data is published to a Redshift database.

Creating a new QuickSight dataset from Redshift

Now that data is in the Redshift instance, follow these steps to load it to SPICE:

  1. From the QuickSight home page, click on Manage data.
  2. Select New data set and then select Redshift option.
  3. Enter details of the Redshift data source, as shown in the following screenshot, then and click on Connect:

    Creating a new QuickSight dataset from Redshift

    Figure 2.21 Load Redshift to SPICE

  4. Next select the schema and table from the drop-down menu, as shown in the following screenshot:

    Creating a new QuickSight dataset from Redshift

    Figure 2.22 Redshift table selection

  5. Select Prepare data to edit the data format, as shown in the following screenshot:

    Creating a new QuickSight dataset from Redshift

    Figure 2.23: Redshift prepare data

  6. In the Prepare screen, you can see all the fields and data; next, click on Prepare data & visualize to analyze this data in QuickSight.

    Creating a new QuickSight dataset from Redshift

    Figure 2.24: Dataset preview

  7. Now in QuickSight visualization, you can change the visualization type, dimensions, measures, and see an interesting scatterplot that shows which zip codes have the highest average median income and the corresponding population. With our dataset, zip code 21771 has a population of 26,023 and average median salary of $105,144.

    Creating a new QuickSight dataset from Redshift

    Figure 2.25: Redshift data visualization

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset