Joining data in SPICE

Joining tables is a very common requirement for data analysis and SPICE makes this effort a breeze. To understand this feature, let's go for a real use case. In the Chapter 2, Exploring Any Data, Loading data from Redshift to Quicksight section, we had uploaded USACensusSalarybyZip.csv, which has salary and population information by USA zip code. Now let's upload another file that has mapping of zip codes to USA states and cities and then we will join this with the other dataset in SPICE.

Loading data to Redshift

First we need to load this new file to Redshift as a table. Here are the steps to load the data to Redshift:

  1. Download the data file from this GitHub location: https://github.com/rnadipalli/quicksight/blob/master/sampledata/USAZipCodes.csv.
  2. Upload the file to your S3 bucket.
  3. Create a new AWS data pipeline to upload S3 data to your Redshift cluster and call it USAZIPCODES. You can clone the data pipeline created in the Chapter 2, Exploring Any Data, Creating and executing an AWS Data Pipeline section. The following is a screenshot of the new data pipeline:

    Loading data to Redshift

    Figure 3.4: AWS pipeline zip code

  4. For your reference, I have uploaded this data pipeline at https://github.com/rnadipalli/quicksight/blob/master/awsdatapipelines/LoadS3ZipcodetoRedshiftpipeline.json.
  5. Once this data is uploaded to Redshift, you can use a database management tool like SQL Workbench/J to view and query the data.

Loading data to Redshift

Figure 3.5: USA zip codes table description

Note

You can download the SQL Workbench/J software from http://www.sql-workbench.net/downloads.html.

This completes the loading of new data to Redshift; next, we will review the steps to join the two tables and create a new dataset.

Creating a new joined dataset

Now that we have two tables in the Redshift instance, we follow these steps to create a new dataset and load it to SPICE:

  1. From the QuickSight home page, click on Manage data.
  2. Click on New data set and then select the Redshift option.
  3. Next, enter the details of the Redshift data source and select the new table that was created, as shown in the following screenshot:

    Creating a new joined dataset

    Figure 3.6: Preparing data after loading to Redshift

  4. Now, in the data preparation page, select both the usasalarybyzip and usazipcodes tables as shown in the next figure.
  5. Next, click on the ? sign between the two tables to configure the join criteria. You have the following options for join criteria:
    • Inner: This gives only records that have matching values in the join columns from the two tables. This is the most commonly used and default option and is also shown in the next screenshot, where the join field is zip code.
    • Left: This performs a left outer join, returning all records from the left-hand side table and only records that have a value in the join column for the right-hand side table.
    • Right: This performs a right outer join, returning all records from the right-hand side table and only records that have a value in the join column for the left-hand side table.
    • Outer: This performs a full outer join and returns records from both tables regardless of the values in the join columns.
  6. Click on Apply to activate the join and see the results:

    Creating a new joined dataset

    Figure 3.7: Join using SPICE

  7. SPICE then shows the resultant combined dataset, as shown in the following screenshot. Next, click on Prepare data & visualize to visualize the joined dataset.

    Creating a new joined dataset

    Figure 3.8: Prepare and visualize the joined dataset

  8. Now, in QuickSight visualization, you can change the visualization type, dimensions, and measures and see an interesting stacked bar that shows which city has the best average salary in orange bars; the corresponding population is shown in blue bars. The following screenshot shows that Pound Ridge has an average salary of $219,554 and a population of 4,764:

    Creating a new joined dataset

    Figure 3.9: Visualization on joined data

This completes the joining dataset section; next, let's review other ways to enrich data in SPICE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset