Loading data from Athena to QuickSight

In this section, we will review how to use Athena, which is a new service from Amazon that enables SQL queries on S3 files without the need for any additional infrastructure.

Uploading data to S3

For this demonstration, we will use all of the files for CollegeScorecard from Data.gov as discussed in Chapter 1, A Quick Start to QuickSightBuilding your first analysis under 60 seconds section.

Note

The dataset is available from the following public URL https://catalog.data.gov/dataset/college-scorecard.

Here are the detailed steps to upload a file to an S3 filesystem:

  1. Download the CollegeScorecard_Raw_Data.zip to your local system (laptop) and unzip the file.
  2. To upload the file to AWS S3, login to our account and from the Services menu, select S3.
  3. Select the S3 bucket or create a new S3 bucket. In the following screenshot, I have selected the collegescorecard bucket that I created earlier.
  4. Create a folder CollegeRaw and then subfolders, one for each year 2010, 2011, 2012, and 2013, as shown in the following screenshot:

    Uploading data to S3

    Figure 2.26: Athena S3 upload

  5. To each subfolder, upload the corresponding data file; for example, MERGED2010_PP.csv file goes to year=2010 subfolder and repeat the same for all years.

This completes the data loading to S3 and next we will see how to view this data in Athena.

Creating a table in Athena

Now that the data is loaded into S3, we can query it using Athena with the following steps:

  1. Open the AWS management console for Athena using this link https://console.aws.amazon.com/athena/home OR search for Athena in the AWS services search bar.
  2. Using the Query Editor, run the create database statement, as shown in the following screenshot:

    Creating a table in Athena

    Figure 2.27: Athena create database

  3. The new database, collegestatsathenadb, should appear in the drop-down menu on the left-hand side. Select the new database.
  4. Create a new table for the files in S3 on the collegescorecard raw data with a partition clause. The query is in GitHub at this location: https://github.com/rnadipalli/quicksight/blob/master/sqlscripts/loadtoAthena.sql.

    Creating a table in Athena

    Figure 2.28: Athena create table

  5. After the table is created, verify it by browsing it on the left-hand panel.
  6. To load all partitions of the table, run the following command:
            MSCK REPAIR TABLE CollegeStatsAthenaDB.CollegeStats;
    
  7. You can now query the table with select * from CollegeStatsAthenaDB.CollegeStats limit 5; and view data as shown in the following screenshot:

    Creating a table in Athena

    Figure 2.29: Athena preview data

This completes the creation of a table in an Athena database. Next, we will see how to visualize this data in QuickSight.

Creating a new QuickSight dataset from Athena

Now that data is accessible via Athena, follow these steps to create a new dataset in QuickSight:

  1. From the QuickSight home page, click on Manage data.
  2. Select New data set and then select Athena.
  3. For the Data source name, enter the same name as the Athena database; CollegeStatsAthenaDB.
  4. Click on Validate connection to confirm that QuickSight can connect to Athena. After it is validated, click on Create data source to complete the data source creation, as shown in the following screenshot:

    Creating a new QuickSight dataset from Athena

    Figure 2.30: Athena data source

  5. Select the table collegestats from the table selection and then select Edit/Preview data, as shown in the following screenshot:

    Creating a new QuickSight dataset from Athena

    Figure 2.31: Athena table selection

  6. The table has over 1700 fields; for this demonstration, we will focus only on enrollment-related fields and only for public colleges. For this, we will use the QuickSight option to report data based on custom SQL, as shown in the following screenshot:

    Creating a new QuickSight dataset from Athena

    Figure 2.32: Athena data source import to SPICE

    The query is as follows:

            select Year, UNITID, INSTNM, CITY, STABBR, ZIP,region, 
            ADM_RATE,
            CASE WHEN (SATVRMID='NULL') THEN '0' ELSE SATVRMID END as
              SAT_Midpoint_Reading,
            CASE WHEN (SATMTMID='NULL') THEN '0' ELSE SATMTMID END as
              SAT_Midpoint_Math,
            CASE WHEN (SATWRMID='NULL') THEN '0' ELSE SATWRMID END as
            SAT_Midpoint_Writing,
            CASE WHEN (SAT_AVG='NULL') THEN '0' ELSE SAT_AVG END as
              SAT_Average,
            CASE WHEN (ACTCMMID='NULL') THEN '0' ELSE ACTCMMID END as
              ACT_Midpoint_Cumulative,
            CASE WHEN (ACTENMID='NULL') THEN '0' ELSE ACTENMID END as
              ACT_Midpoint_English,
            CASE WHEN (ACTMTMID='NULL') THEN '0' ELSE ACTMTMID END as
              ACT_Midpoint_Math,
            CASE WHEN (ACTWRMID='NULL') THEN '0' ELSE ACTWRMID END as
              ACT_Midpoint_Writing
            CASE WHEN (NPT4_PUB='NULL') THEN '0' ELSE NPT4_PUB END as
              Average_Net_Tuition_Price,
            CASE WHEN (UGDS='NULL') THEN '0' ELSE UGDS END as
              Enrollment_All_Count,
            CASE WHEN (TUITIONFEE_IN='NULL') THEN '0' ELSE TUITIONFEE_IN
              END as Average_InState_Tuition,
            CASE WHEN (TUITIONFEE_OUT='NULL') THEN '0' ELSE TUITIONFEE_OUT
              END as Average_OutState_Tuition
            from CollegeStatsAthenaDB.CollegeStats
            where main = '1'
            and HIGHDEG = '4'
            and CONTROL = '1';
    
  7. Update the data type of all numeric fields to integer as this will help the reporting and visualization.
  8. Click on Save & visualize to analyze this data.
  9. Change the visualization type to vertical stacked bar chart; set the Enrollment_All_Count to a measure; select X axis as Year, Value as Enrollement_All_Count, and Group/Color as STABBAR. Finally, to focus only on top states by count, filter the chart with Enrollment_All_Count greater than 200,000.
  10. This will give you a pretty useful trend that shows that the state of Ohio has a drop of enrollment from 219K in year 2010 to 217K in year 2013 in comparison to the state of Texas where enrollment grew from 438K to 467K for the same time period, as shown in the following screenshot:

    Creating a new QuickSight dataset from Athena

    Figure 2.33: Athena bar chart

This completes the loading of data from Athena to QuickSight.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset