Working with Redshift databases and tables

Before we start querying the Redshift database, we will first need to upload some same data to it. For this particular scenario, we are going to use a small subset of HTTP request logs that originated from a web server at the NASA Kennedy Space Center in Florida. This data is available for public use and can be downloaded from here: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

The log file essentially contains the following set of columns:

  • Host: The host that is making the web request to the web server. This field contains fully qualified hostnames or IP addresses as well.
  • Timestamp: The timestamp of the particular web request. The format is DAY MON DD HH:MM:SS YYYY. This timestamp uses a 24-hour clock.
  • Request: The method used to request the server (GET/HEAD/POST).
  • URL: The URL of the resource that was requested by the client.
  • Response: This contains the HTTP response code (200, 302, 304, and 404).
  • Bytes: The size of the reply in bytes.

Here's a snippet of the data for your reference:

pppa006.compuserve.com,807256800,GET,/images/launch-logo.gif,200,1713 
vcc7.langara.bc.ca,807256804,GET,/shuttle/missions/missions.html,200,8677 
pppa006.compuserve.com,807256806,GET,/history/apollo/images/apollo-logo1.gif,200,1173 
You can download the sample CSV file (2.14 MB containing 30,970 entries) used for this scenario using the following link:
https://github.com/yoyoclouds/Administering-AWS-Volume2.

With the file downloaded, all you need to do is upload it to one of your Amazon S3 buckets. Remember, that this bucket should be accessible by Amazon Redshift otherwise you may get a S3ServiceException: Access Denied exception during execution.

Next, from the SQL Workbench/J client, type in the following code to create a new table within our Redshift database:

create table apachelogs( 
host varchar(100), 
time varchar(20), 
method varchar(8), 
url varchar(200), 
response varchar(10), 
bytes varchar(10)); 
You can find the complete copy of the previous code at: https://github.com/yoyoclouds/Administering-AWS-Volume2.

Select the Execute Query button. You should receive an output stating that the table is created, as shown in the following screenshot:

Next, use the COPY command to load the contents of the data file stored in Amazon S3 into the newly created Redshift table. The COPY command is a very versatile command and can be used to load data residing in Amazon S3, Amazon EMR, or even from an Amazon DynamoDB table into Amazon Redshift. To know more about the COPY command, navigate to this URL: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html.

Substitute the values of <REDSHIFT_TABLE_NAME> with the name of the newly created table, the <BUCKET_NAME> with the name of the S3 bucket that contains the data file, and <REDSHIFT_IAM_ROLE_ARN> with the ARN of the IAM read-only access role that we created as a part of Amazon Redshift's prerequisite process:

copy <REDSHIFT_TABLE_NAME> from 's3://<BUCKET_NAME>/data.csv'  
credentials 'aws_iam_role=<REDSHIFT_IAM_ROLE_ARN>'  
csv; 

Once the code is pasted into the SQL Workbench/J, click on the Execute Query button. Here is a snapshot of the command execution from SQL Workbench/J:

With the data loaded, you can now use simple queries to query the dataset, as described in this section. The following command will list all 30,970 records from the table:

select * from apachelogs;

The following command will list only those records whose response value was 404:

select * from apachelogs where response=404; 

The following command will list all the hosts that have requested for the particular resource:

select host from apachelogs where url='/images/NASA-logosmall.gif'; 

You can also use the Redshift dashboard to view the performance and runtime of each individual query by first selecting your Redshift cluster name from the Cluster page. Next, select the Queries tab to bring up the list of the most recently executed queries, as shown in the following screenshot:

You can drill down into each query by further selecting the query identification number as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset