Chapter 4
Hello, Databases

THE AWS CERTIFIED DEVELOPER – ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

  • Domain 2: Security
  • check mark 2.1 Make authenticated calls to AWS services.
  • check mark 2.2 Implement encryption using AWS services.
  • Domain 3: Development with AWS Services
  • check mark 3.2 Translate functional requirements into application design.
  • check mark 3.3 Write code that interacts with AWS services by using APIs, SDKs, and AWS CLI.
  • Domain 4: Refactoring
  • check mark 4.1 Optimize application to best use AWS services and features.
  • Domain 5: Monitoring and Troubleshooting
  • check mark 5.2 Perform root cause analysis on faults found in testing or production.

Introduction to Databases

In addition to the storage options discussed in the previous chapters, AWS offers a broad range of databases purposely built for your specific application use cases. You can also set up your own database platform on the Amazon Elastic Compute Cloud (Amazon EC2). You can easily migrate your existing databases with the AWS Database Migration Service (AWS DMS) in a cost-effective manner.

AWS Cloud offerings include the following databases:

  • Managed relational databases—For transactional applications
  • Nonrelational databases—For internet-scale applications
  • Data warehouse databases—For analytics
  • In-memory data store databases—For caching and real-time workloads
  • Time-series databases—For efficiently collecting, synthesizing, and deriving insights from time-series data
  • Ledger databases—For when you need a centralized, trusted authority to maintain a scalable, complete, and cryptographically verifiable record of transactions
  • Graph databases—For building applications with highly connected data

Depending on your use case, you can choose from an AWS database that closely aligns with your needs. Table 4.1 describes each database service that AWS offers and indicates the database type.

Table 4.1 AWS Database Service Mapping to Database Type

Product Type Description
Amazon Aurora Relational database A MySQL- and PostgreSQL-compatible relational database built for the cloud that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
Amazon Relational Database Service (Amazon RDS) Relational database A managed relational database for MySQL, PostgreSQL, Oracle, SQL Server, and MariaDB. Easy to set up, operate, and scale a relational database in the cloud quickly.
Amazon DynamoDB NoSQL database A serverless, managed NoSQL database that delivers consistent single-digit millisecond latency at any scale. Pay only for the throughput and storage you use.
Amazon Redshift Data warehouse A fast, fully managed, petabyte-scale data warehouse at one-tenth the cost of traditional solutions. Simple and cost-effective solution to analyze data by using standard SQL and your existing business intelligence (BI) tools.
Amazon ElastiCache In-memory data store To deploy, operate, and scale an in-memory data store based on Memcached or Redis in the cloud.
Amazon Neptune Graph database A fast, reliable, fully managed graph database to store and manage highly connected datasets.
Amazon Document DB (with MongoDB compatibility) Nonrelational database A fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.
Amazon Timestream Time series database A fast, scalable, fully managed time series database service for IoT and operational applications that makes it easy to store and analyze trillions of events per day at one-tenth the cost of relational databases.
Amazon Quantum Ledger Database (Amazon QLDB) Ledger database A fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority.
AWS Database Migration Service (AWS DMS) Database migration Help migrate your databases to AWS easily and inexpensively with minimal downtime.

Now that you know the purpose of these database services and what they can do, review the type of applications that can be used for each AWS database service. Refer to the application type mappings shown in Table 4.2.

Table 4.2 Application Mapping to AWS Database Service

Applications Product
Transactional applications, such as ERP, CRM, and ecommerce to log transactions and store structured data. Aurora or Amazon RDS
Internet-scale applications, such as hospitality, dating, and ride sharing, to serve content and store structured and unstructured data. DynamoDB or Amazon DocumentDB
Analytic applications for operational reporting and querying terabyte- to exabyte-scale data. Amazon Redshift
Real-time application use cases that require submillisecond latency such as gaming leaderboards, chat, messaging, streaming, and Internet of Things (IoT). ElastiCache
Applications with use cases that require navigation of highly connected data such as social news feeds, recommendations, and fraud detection. Neptune
Applications that collect data at millions of inserts per second in a time-series fashion, for example clickstream data and IoT devices. Timestream
Applications that require an accurate history of their application data; for example, tracking the history of credits and debits in banking transactions or verifying the audit trails created in relational databases. Amazon QLDB

Relational Databases

Many developers have had to interact with relational databases in their applications. This section describes first what a relational database is. Then, it covers how you can run a relational database in the AWS Cloud with Amazon RDS or on Amazon EC2.

A relational database is a collection of data items with predefined relationships between them. These items are organized as a set of tables with columns and rows. Tables store information about the objects to be represented in the database. Each column in a table holds certain data, and a field stores the actual value of an attribute. The rows in the table represent a collection of related values of one object or entity. Each row in a table contains a unique identifier called a primary key, and rows among multiple tables can be linked by using foreign keys. You can access data in many different ways without reorganizing the database tables.

Characteristics of Relational Databases

Relational databases include four important characteristics: Structured Query Language, data integrity, transactions, and atomic, consistent, isolated, and durable compliance.

Structured Query Language

Structured query language (SQL) is the primary interface that you use to communicate with relational databases. The standard American National Standards Institute (ANSI) SQL is supported by all popular relational database engines. Some of these engines have extensions to ANSI SQL to support functionality that is specific to that engine. You use SQL to add, update, or delete data rows; to retrieve subsets of data for transaction processing and analytics applications; and to manage all aspects of the database.

Data Integrity

Data integrity is the overall completeness, accuracy, and consistency of data. Relational databases use a set of constraints to enforce data integrity in the database. These include primary keys, foreign keys, NOT NULL constraints, unique constraint, default constraints, and check constraints. These integrity constraints help enforce business rules in the tables to ensure the accuracy and reliability of the data. In addition, most relational databases enable you to embed custom code triggers that execute based on an action on the database.

Transactions

A database transaction is one or more SQL statements that execute as a sequence of operations to form a single logical unit of work. Transactions provide an all-or-nothing proposition, meaning that the entire transaction must complete as a single unit and be written to the database, or none of the individual components of the transaction will continue. In relational database terminology, a transaction results in a COMMIT or a ROLLBACK. Each transaction is treated in a coherent and reliable way, independent of other transactions.

ACID Compliance

All database transactions must be atomic, consistent, isolated, and durable (ACID)–compliant or be atomic, consistent, isolated, and durable to ensure data integrity.

Atomicity Atomicity requires that the transaction as a whole executes successfully, or if a part of the transaction fails, then the entire transaction is invalid.

Consistency Consistency mandates that the data written to the database as part of the transaction must adhere to all defined rules and restrictions, including constraints, cascades, and triggers.

Isolation Isolation is critical to achieving concurrency control, and it makes sure that each transaction is independent unto itself.

Durability Durability requires that all of the changes made to the database be permanent when a transaction is successfully completed.

Managed vs. Unmanaged Databases

Managed database services on AWS, such as Amazon RDS, enable you to offload the administrative burdens of operating and scaling distributed databases to AWS so that you don’t have to worry about the following tasks:

  • Hardware provisioning
  • Setup and configuration
  • Throughput capacity planning
  • Replication
  • Software patching
  • Cluster scaling

AWS provides a number of database alternatives for developers. As a managed database, Amazon RDS enables you to run a fully featured relational database while off-loading database administration. By contrast, you can run unmanaged databases on Amazon EC2, which gives you more flexibility on the types of databases that you can deploy and configure; however, you are responsible for the administration of the unmanaged databases.

Amazon Relational Database Service

With Amazon Relational Database Service (Amazon RDS), you can set up, operate, and scale a relational database in the AWS Cloud. It provides cost-efficient, resizable capacity for open-standard relational database engines. Amazon RDS is easy to administer, and you do not need to install the database software. Amazon RDS manages time-consuming database administration tasks, which frees you up to focus on your applications and business. For example, Amazon RDS automatically patches the database software and backs up your database. The Amazon RDS managed relational database service works with the popular database engines depicted in Figure 4.1.

The figure shows Amazon RDS database engines.

Figure 4.1 Amazon RDS database engines

Amazon RDS assumes many of the difficult or tedious management tasks of a relational database:

Procurement, configuration, and backup tasks

  • When you buy a server, you get a central processing unit (CPU), memory, storage, and input/output operations per second (IOPS) all bundled together. With Amazon RDS, these are split apart so that you can scale them independently and allocate your resources as you need them.
  • Amazon RDS manages backups, software patches, automatic failure detection, and recovery.
  • You can configure automated backups or manually create your own backup snapshot and use these backups to restore a database. The Amazon RDS restore process works reliably and efficiently.
  • You can use familiar database products: MySQL, MariaDB, PostgreSQL, Oracle, Microsoft SQL Server, and the MySQL- and PostgreSQL-compatible Amazon Aurora DB engine.

Security and availability

  • You can enable the encryption option for your Amazon RDS DB instance.
  • You can get high availability with a primary instance and a synchronous secondary instance that you can fail over to when problems occur. You can also use MySQL, MariaDB, or PostgreSQL read replicas to increase read scaling.
  • In addition to the security in your database package, you can use AWS Identity and Access Management (IAM) to define users, and permissions help control who can access your Amazon RDS databases. You can also help protect your databases by storing them in a virtual private cloud (VPC).
  • To deliver a managed service experience, Amazon RDS does not provide shell access to DB instances, and it restricts access to certain system procedures and tables that require advanced permissions.

When you host databases on Amazon RDS, AWS is responsible for the items in Figure 4.2.

The figure show Amazon RDS host responsibilities.

Figure 4.2 Amazon RDS host responsibilities

Relational Database Engines on Amazon RDS

Amazon RDS provides six familiar database engines: Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL, and MariaDB. Because Amazon RDS is a managed service, you gain a number of benefits and features built right into the Amazon RDS service. These features include, but are not limited to, the following:

  • Automatic software patching
  • Easy vertical scaling
  • Easy storage scaling
  • Read replicas
  • Automatic backups
  • Database snapshots
  • Multi-AZ deployments
  • Encryption
  • IAM DB authentication
  • Monitoring and metrics with Amazon CloudWatch

To create an Amazon RDS instance, you can run the following command from the AWS CLI:

aws rds create-db-instance 
--db-instance-class db.t2.micro 
--allocated-storage 30 
--db-instance-identifier my-cool-rds-db --engine mysql 
--master-username masteruser --master-user-password masterpassword1!

Depending on the configurations chosen, the database can take several minutes before it is active and ready for use. You can monitor the Amazon RDS Databases console to view the status. When the status states Available, it is ready to be used, as shown in Figure 4.3.

The figure shows a screenshot illustrating how to monitor the Amazon RDS Databases console.

Figure 4.3 Amazon RDS Databases console

Automatic Software Patching

Periodically, Amazon RDS performs maintenance on Amazon RDS resources. Maintenance mostly involves patching the Amazon RDS database underlying operating system (OS) or database engine version. Because this is a managed service, Amazon RDS handles the patching for you.

When you create an Amazon RDS database instance, you can define a maintenance window. A maintenance window is where you can define a period of time when you want to apply any updates or downtime to your database instance. You also can enable the automatic minor version upgrade feature, which automatically applies any new minor versions of the database as they are released (see Figure 4.4).

The figure shows a screenshot illustrating the maintenance window.

Figure 4.4 Maintenance window

You can select a maintenance window by using the AWS Management Console, AWS CLI, or Amazon RDS API. After selecting the maintenance window, the Amazon RDS instance is upgraded (if upgrades are available) during that time window. You can also modify the maintenance window by running the following AWS CLI command:

aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --preferred-maintenance-window Mon:07:00-Mon:07:30
Vertical Scaling

If your database needs to handle a bigger load, you can vertically scale your Amazon RDS instance. At the time of this writing, there are 40 available DB instance classes, which enable you to choose the number of virtual CPUs and memory available. This gives you flexibility over the performance and cost of your Amazon RDS database. To scale the Amazon RDS instance, you can use the console, AWS CLI, or AWS SDK.

If you are in a Single-AZ configuration for your Amazon RDS instance, the database is unavailable during the scaling operation. However, if you are in a Multi-AZ configuration, the standby database is upgraded first and then a failover occurs to the newly configured database. You can also apply the change during the next maintenance window. This way, your upgrade can occur during your normal outage windows.

To scale the Amazon RDS database by using the AWS CLI, run the following command:

aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --db-instance-class db.t2.medium
Easy Storage Scaling

Storage is a critical component for any database. Amazon RDS has the following three storage types:

General Purpose SSD (gp2) This storage type is for cost-effective storage that is ideal for a broad range of workloads. Gp2 volumes deliver single-digit millisecond latencies and the ability to burst to 3,000 IOPS for extended periods of time. The volume’s size determines the performance of gp2 volumes.

Provisioned IOPS (io1) This storage type is for input/output-intensive workloads that require low input/output (I/O) latency and consistent I/O throughput.

Magnetic Storage This storage type is designed for backward compatibility, and AWS recommends that you use General Purpose SSD or Provisioned IOPS for any new Amazon RDS workloads.

To scale your storage, you must modify the Amazon RDS DB instance by executing the following AWS CLI command:

aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --allocated-storage 50 --storage-type io1 --iops 3000

This command modifies your storage to 50 GB in size, with a Provisioned IOPS storage drive and a dedicated IOPS of 3000. While modifying the Amazon RDS DB instance, consider the potential downtime.

Read Replicas (Horizontal Scaling)

There are two ways to scale your database tier with Amazon RDS: vertical scaling and horizontal scaling. Vertical scaling takes the primary database and increases the amount of memory and vCPUs allocated for the primary database. Alternatively, use horizontal scaling (add another server) to your database tier to improve the performance of applications that are read-heavy as opposed to write-heavy.

Read replicas create read-only copies of your master database, which allow you to offload any reads (or SQL SELECT statements) to the read replica. The replication from the master database to the read replica is asynchronous. As a result, the data queried from the read replica is not the latest data. If your application requires strongly consistent reads, consider an alternative option.

At the time of this writing, Amazon RDS MySQL, PostgreSQL, and MariaDB support up to five read replicas, and Amazon Aurora supports up to 15 read replicas. Microsoft SQL Server and Oracle do not support read replicas.

To create a read replica by using AWS CLI, run the following command:

aws rds create-db-instance-read-replica --db-instance-identifier your-db-instance-identifier --source-db-instance-identifier your-source-db

Backing Up Data with Amazon RDS

Amazon RDS has two different ways of backing up data of your database instance: automated backups and database snapshots (DB snapshots).

Automated Backups (Point-in-Time)

With Amazon RDS, automated backups offer a point-in-time recovery of your database. When enabled, Amazon RDS performs a full daily snapshot of your data that is taken during your preferred backup window. After the initial backup is taken (each day), then Amazon RDS captures transaction logs as changes are made to the database.

After you initiate a point-in-time recovery, to restore your database instance, the transaction logs are applied to the most appropriate daily backup. You can perform a restore up to the specific second, as long as it’s within your retention period. The default retention period is seven days, but it can be a maximum of up to 35 days.

To perform a restore, you must choose the Latest Restorable Time, which is typically within the last 5 minutes. For example, suppose that the current date is February 14 at 10 p.m., and you would like to do a point-in-time restore of February 14 at 9 p.m. This restore would succeed because the Latest Restorable Time is a maximum of February 14 at 9:55 p.m. (which is the last 5-minute window). However, a point-in-time restore of February 14 at 9:58 p.m. would fail, because it is within the 5-minute window.

Automated backups are kept until the source database is deleted. After the source Amazon RDS instance is removed, the automated backups are also removed.

Database Snapshots (Manual)

Unlike automated backups, database snapshots with Amazon RDS are user-initiated and enable you to back up your database instance in a known state at any time. You can also restore to that specific snapshot at any time.

Similar to the other Amazon RDS features, you can create the snapshots through the AWS Management Console, with the CreateDBSnapshot API, or with the AWS CLI.

With DB snapshots, the backups are kept until you explicitly delete them; therefore, before removing any Amazon RDS instance, take a final snapshot before removing it. Regardless of the backup taken, storage I/O may be briefly suspended while the backup process initializes (typically a few seconds), and you may experience a brief period of elevated latency. A way to avoid these types of suspensions is to deploy in a Multi-AZ configuration. With such a deployment, the backup is taken from the standby instead of the primary database.

To create a snapshot of the database, from the Amazon RDS Databases console, under Actions, select the Take snapshot option (see Figure 4.5). After a snapshot is taken, you can view all of your snaps from the Snapshots console.

The figure shows a screenshot illustrating how to create a snapshot of the database, from the Amazon RDS Databases console, under Actions, selecting the Take snapshot option.

Figure 4.5 Taking an Amazon RDS snapshot

Multi-AZ Deployments

By using Amazon RDS, you can run in a Multi-AZ configuration. In a Multi-AZ configuration, you have a primary and a standby DB instance. Updates to the primary database replicate synchronously to the standby replica in a different Availability Zone. The primary benefit of Multi-AZ is realized during certain types of planned maintenance, or in the unlikely event of a DB instance failure or an Availability Zone failure. Amazon RDS automatically fails over to the standby so that you can resume your workload as soon as the standby is promoted to the primary. This means that you can reduce your downtime in the event of a failure.

Because Amazon RDS is a managed service, Amazon RDS handles the fail to the standby. When there is a DB instance failure, Amazon RDS automatically promotes the standby to the primary—you will not interact with the standby directly. In other words, you will receive one endpoint connection for the Amazon RDS cluster, and Amazon RDS handles the failover.

Amazon RDS Multi-AZ configuration provides the following benefits:

  • Automatic failover; no administration required
  • Increased durability in the unlikely event of component failure
  • Increased availability in the unlikely event of an Availability Zone failure
  • Increased availability for planned maintenance (automated backups; I/O activity is no longer suspended)

To create an Amazon RDS instance in a Multi-AZ configuration, you must specify a subnet group that has two different Availability Zones specified. You can specify a Multi-AZ configuration by using AWS CLI by adding the --multi-az flag to the AWS CLI command, as follows:

aws rds create-db-instance 
--db-instance-class db.t2.micro 
--allocated-storage 30 
--db-instance-identifier multi-az-rds-db --engine mysql 
--master-username masteruser 
--master-user-password masterpassword1! 
--multi-az

Encryption

For encryption at rest, Amazon RDS uses the AWS Key Management Service (AWS KMS) for AES-256 encryption. You can use a default master key or specify your own for the Amazon RDS DB instance. Encryption is one of the few options that must be configured when the DB instance is created. You cannot modify an Amazon RDS database to enable encryption. You can, however, create a DB snapshot and then restore to an encrypted DB instance or cluster.

Amazon RDS supports using the Transparent Data Encryption (TDE) for Oracle and SQL Server. For more information on TDE with Oracle and Microsoft SQL Server, see the following:

At the time of this writing, the following Amazon RDS DB instance types are not supported for encryption at rest:

  • Db.m1.small
  • Db.m1.medium
  • Db.m1.large
  • Db.m1.xlarge
  • Db.m2.xlarge
  • Db.m2.2xlarge
  • Db.m2.4xlarge
  • Db.t2.micro

For encryption in transit, Amazon RDS generates an SSL certificate for each database instance that can be used to connect your application and the Amazon RDS instance. However, encryption is a compute-intensive operation that increases the latency of your database connection. For more information, see the documentation for the specific database engine.

IAM DB Authentication

You can authenticate to your DB instance by using IAM. By using IAM, you can manage access to your database resources centrally instead of storing the user credentials in each database. The IAM feature also encrypts network traffic to and from the database by using SSL.

IAM DB authentication is supported only for MySQL and PostgreSQL. At the time of this writing, the following MySQL versions are supported:

  • MySQL 5.6.34 or later
  • MySQL 5.7.16 or later

    There’s no support for the following:

  • IAM DB Authentication for MySQL 5.5 or MySQL 8.0
  • db.t2.small and db.m1.small instances

    The following PostgreSQL versions are supported:

  • PostgreSQL versions 10.6 or later
  • PostgreSQL 9.6.11 or later
  • PostgreSQL 9.5.15 or later

To enable IAM DB authentication for your Amazon RDS instance, run the following command:

aws rds modify-db-instance --db-instance-identifier my-rds-db --enable-iam-database-authentication --apply-immediately

Because downtime is associated with this action, you can enable this feature during the next maintenance window. You can do so by changing the last parameter to --no-apply-immediately.

Monitoring with Amazon CloudWatch

Use Amazon CloudWatch to monitor your database tier. You can create alarms to notify database administrators when there is a failure.

By default, CloudWatch provides some built-in metrics for Amazon RDS with a granularity of 5 minutes (600 seconds). If you want to gather metrics in a smaller window of granularity, such as 1 second, enable enhanced monitoring, which is similar to how you enable these features in Amazon EC2.

To view all the Amazon RDS metrics that are provided through CloudWatch, select the Monitoring tab from the Amazon RDS console (see Figure 4.6).

The figure shows a screenshot illustrating how to view all the Amazon RDS metrics that are provided through Amazon CloudWatch, selecting the Monitoring tab from the Amazon RDS console.

Figure 4.6 Amazon RDS with CloudWatch metrics

Amazon RDS integrates with CloudWatch to send it the following database logs:

  • Audit log
  • Error log
  • General log
  • Slow query log

From the Amazon RDS console, select the Logs & events tab to view and download the specified logs, as shown in Figure 4.7.

The figure shows a screenshot illustrating Amazon RDS with CloudWatch Logs.

Figure 4.7 Amazon RDS with CloudWatch Logs

For more information on CloudWatch and its capabilities across other AWS services, see Chapter 15, “Monitoring and Troubleshooting.”

Amazon Aurora

Amazon Aurora is a MySQL- and PostgreSQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Aurora is part of the managed database service Amazon RDS.

Amazon Aurora DB Clusters

Aurora is a drop-in replacement for MySQL and PostgreSQL relational databases. It is built for the cloud, and it combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. You can use the code, tools, and applications that you use today with your existing MySQL and PostgreSQL databases with Aurora.

The integration of Aurora with Amazon RDS means that time-consuming administration tasks, such as hardware provisioning, database setup, patching, and backups, are automated.

Aurora features a distributed, fault-tolerant, self-healing storage system that automatically scales up to 64 TiB per database instance. (In comparison, other Amazon RDS options allow for a maximum of 32 TiB.) Aurora delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon Simple Storage Service (Amazon S3), and replication across three Availability Zones. When you create an Aurora instance, you create a DB cluster. A DB cluster consists of one or more DB instances and a cluster volume that manages the data for those instances. An Aurora cluster volume is a virtual database storage volume that spans multiple Availability Zones, and each Availability Zone has a copy of the DB cluster data.

An Aurora DB cluster has two types of DB instances:

Primary Instance Supports read and write operations and performs all of the data modifications to the cluster volume. Each Aurora DB cluster has one primary instance.

Amazon Aurora Replica Supports read-only operations. Each Aurora DB cluster can have up to 15 Amazon Aurora Replicas in addition to the primary instance. Multiple Aurora Replicas distribute the read workload, and if you locate Aurora Replicas in separate Availability Zones, you can also increase database availability.

Figure 4.8 illustrates the relationship between the cluster volume, the primary instance, and Aurora Replicas in an Aurora DB cluster.

The figure illustrates the relationship between the cluster volume, the primary instance, and Aurora Replicas in an Aurora DB cluster.

Figure 4.8 Amazon Aurora DB cluster

As you can see from Figure 4.8, this architecture is vastly different from the other Amazon RDS databases. Aurora is engineered and architected for the cloud. The primary difference is that there is a separate storage layer, called the cluster volume, which is spread across multiple Availability Zones in a single AWS Region. This means that the durability of your data is increased.

Additionally, Aurora has one primary instance that writes across the cluster volume. This means that Aurora replicas can be spun up quickly, because they don’t have to copy and store their own storage layer; they connect to it. Because the cluster volume is separated in this architecture, the cluster volume can grow automatically as your data increases. This is in contrast to how other Amazon RDS databases are built, whereby you must define the allocated storage in advance.

Amazon Aurora Global Databases

With Aurora, you can also create a multiregional deployment for your database tier. In this configuration, the primary AWS Region is where your data is written (you may also do reads from the primary AWS Region). Any application performing writes must write to the primary AWS Region where the cluster is operating.

The secondary AWS Region is used for reading data only. Aurora replicates the data to the secondary AWS Region with typical latency of less than a second. Furthermore, you can use the secondary AWS Region for disaster recovery purposes. You can promote the secondary cluster and make it available as the primary typically in less than a minute. At the time of this writing, Aurora global databases are available in the following AWS Regions only:

  • US East (N. Virginia)
  • US East (Ohio)
  • US West (Oregon)
  • EU (Ireland)

Additionally, at the time of this writing, Aurora global databases are available only for MySQL 5.6.

Amazon Aurora Serverless

Aurora Serverless is an on-demand, automatic scaling configuration for Aurora. (It is available only for MySQL at the time of this writing.) With Aurora Serverless, the database will automatically start up, shut down, and scale capacity up or down based on your application’s needs. This means that, as a developer, you can run your database in the AWS Cloud and not worry about managing any database instances.

Best Practices for Running Databases on AWS

The following are best practices for working with Amazon RDS:

Follow Amazon RDS basic operational guidelines. The Amazon RDS Service Level Agreement requires that you follow these guidelines:

  • Monitor your memory, CPU, and storage usage. Amazon CloudWatch can notify you when usage patterns change or when you approach the capacity of your deployment so that you can maintain system performance and availability.
  • Scale up your DB instance when you approach storage capacity limits. Have some buffer in storage and memory to accommodate unforeseen increases in demand from your applications.
  • Enable automatic backups, and set the backup window to occur during the daily low in write IOPS.
  • If your database workload requires more I/O than you have provisioned, recovery after a failover or database failure will be slow. To increase the I/O capacity of a DB instance, do any or all of the following:
    • Migrate to a DB instance class with high I/O capacity.
    • Convert from standard storage either to General Purpose or Provisioned IOPS storage, depending on how much of an increase you need. If you convert to Provisioned IOPS storage, make sure that you also use a DB instance class that is optimized for Provisioned IOPS.
    • If you are already using Provisioned IOPS storage, provision additional throughput capacity.
  • If your client application is caching the Domain Name Service (DNS) data of your DB instances, set a time-to-live (TTL) value of less than 30 seconds. Because the underlying IP address of a DB instance can change after a failover, caching the DNS data for an extended time can lead to connection failures if your application tries to connect to an IP address that no longer is in service.
  • Test failover for your DB instance to understand how long the process takes for your use case and to ensure that the application that accesses your DB instance can automatically connect to the new DB instance after failover.

Allocate sufficient RAM to the DB instance. An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. Check the ReadIOPS metric by using CloudWatch while the DB instance is under load to view the working set. The value of ReadIOPS should be small and stable. Scale up the DB instance class until ReadIOPS no longer drops dramatically after a scaling operation or when ReadIOPS is reduced to a small amount.

Implement Amazon RDS security. Use IAM accounts to control access to Amazon RDS API actions, especially actions that create, modify, or delete Amazon RDS resources, such as DB instances, security groups, option groups, or parameter groups, and actions that perform common administrative actions, such as backing up and restoring DB instances, or configuring Provisioned IOPS storage.

  • Assign an individual IAM account to each person who manages Amazon RDS resources. Do not use an AWS account user to manage Amazon RDS resources; create an IAM user for everyone, including yourself.
  • Grant each user the minimum set of permissions required to perform his or her duties.
  • Use IAM groups to manage permissions effectively for multiple users.
  • Rotate your IAM credentials regularly.

Use the AWS Management Console, the AWS CLI, or the Amazon RDS API to change the password for your master user. If you use another tool, such as a SQL client, to change the master user password, it might result in permissions being revoked for the user unintentionally.

Use enhanced monitoring to identify OS issues. Amazon RDS provides metrics in real time for the OS on which your DB instance runs. You can view the metrics for your DB instance by using the console or consume the Enhanced Monitoring JSON output from CloudWatch Logs in a monitoring system of your choice. Enhanced Monitoring is available for the following database engines:

  • MariaDB
  • Microsoft SQL Server
  • MySQL version 5.5 or later
  • Oracle
  • PostgreSQL

Enhanced Monitoring is available for all DB instance classes except for db.m1.small. Enhanced Monitoring is available in all regions except for AWS GovCloud (US).

Use metrics to identify performance issues. To identify performance issues caused by insufficient resources and other common bottlenecks, you can monitor the metrics available for your Amazon RDS DB instance.

Monitor performance metrics regularly to see the average, maximum, and minimum values for a variety of time ranges. If you do so, you can identify when performance is degraded. You can also set CloudWatch alarms for particular metric thresholds.

To troubleshoot performance issues, it’s important to understand the baseline performance of the system. When you set up a new DB instance and get it running with a typical workload, you should capture the average, maximum, and minimum values of all the performance metrics at a number of different intervals (for example, 1 hour, 24 hours, 1 week, or 2 weeks) to get an idea of what is normal. It helps to get comparisons for both peak and off-peak hours of operation. You can then use this information to identify when performance is dropping below standard levels.

Tune queries. One of the best ways to improve DB instance performance is to tune your most commonly used and most resource-intensive queries to make them less expensive to run.

A common aspect of query tuning is creating effective indexes. You can use the Database Engine Tuning Advisor to get potential index improvements for your DB instance.

Use DB parameter groups. AWS recommends that you apply changes to the DB parameter group on a test DB instance before you apply parameter group changes to your production DB instances. Improperly setting DB engine parameters in a DB parameter group can have unintended adverse effects, including degraded performance and system instability. Always exercise caution when modifying DB engine parameters, and back up your DB instance before modifying a DB parameter group.

Use read replicas. Use read replicas to relieve pressure on your master node with additional read capacity. You can bring your data closer to applications in different regions and promote a read replica to a master for faster recovery in the event of a disaster.

Symbol of Tip You can use the AWS Database Migration Service (AWS DMS) to migrate or replicate your existing databases easily to Amazon RDS.

Nonrelational Databases

Nonrelational databases are commonly used for internet-scale applications that do not require any complex queries.

NoSQL Database

NoSQL databases are nonrelational databases optimized for scalable performance and schema-less data models. NoSQL databases are also widely recognized for their ease of development, low latency, and resilience.

NoSQL database systems use a variety of models for data management, such as in-memory key-value stores, graph data models, and document stores. These types of databases are optimized for applications that require large data volume, low latency, and flexible data models, which are achieved by relaxing some of the data consistency restrictions of traditional relational databases.

When to Use a NoSQL Database

NoSQL databases are a great fit for many big data, mobile, and web applications that require greater scale and higher responsiveness than traditional relational databases. Because of simpler data structures and horizontal scaling, NoSQL databases typically respond faster and are easier to scale than relational databases.

Comparison of SQL and NoSQL Databases

Many developers are familiar with SQL databases but might be new to working with NoSQL databases. Relational database management systems (RDBMS) and nonrelational (NoSQL) databases have different strengths and weaknesses. In a RDBMS, data can be queried flexibly, but queries are relatively expensive and do not scale well in high-traffic situations. In a NoSQL database, you can query data efficiently in a limited number of ways. Table 4.3 shows a comparison of different characteristics of SQL and NoSQL databases.

Table 4.3 SQL vs. NoSQL Database Characteristics

Type SQL NoSQL
Data Storage Rows and columns Key-value, document, wide-column, graph
Schemas Fixed Dynamic
Querying Using SQL Focused on collection of documents
Scalability Vertical Horizontal
Transactions Supported Support varies
Consistency Strong Eventual and strong

The storage format for SQL versus NoSQL databases also differs. As shown in Figure 4.9, SQL databases are often stored in a row and column format, whereas NoSQL databases, such as Amazon DynamoDB, have a key-value format that could be in a JSON format, as shown in this example.

The figure shows the storage format for SQL versus NoSQL databases.

Figure 4.9 SQL versus NoSQL format comparison

NoSQL Database Types

There are four types of NoSQL databases: columnar, document, graph, and in-memory key-value. Generally, these databases differ in how the data is stored, accessed, and structured, and they are optimized for different use cases and applications.

Columnar databases Columnar databases are optimized for reading and writing columns of data as opposed to rows of data. Column-oriented storage for database tables is an important factor in analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data that you must load from disk.

Document databases Document databases are designed to store semi-structured data as documents, typically in JSON or XML format. Unlike traditional relational databases, the schema for each NoSQL document can vary, giving you more flexibility in organizing and storing application data and reducing storage required for optional values.

Graph databases Graph databases store vertices and directed links called edges. Graph databases can be built on both SQL and NoSQL databases. Vertices and edges can each have properties associated with them.

In-memory key-value stores In-memory key-value stores are NoSQL databases optimized for read-heavy application workloads (such as social networking, gaming, media sharing, and Q&A portals) or compute-intensive workloads (such as a recommendation engine). In-memory caching improves application performance by storing critical pieces of data in memory for low-latency access.

Amazon DynamoDB

Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed cloud database, and it supports both document and key-value store models. Its flexible data model, reliable performance, and automatic scaling of throughput capacity make it a great fit for the following:

  • Mobile
  • Gaming
  • Adtech
  • Internet of Things (IoT)
  • Applications that do not require complex queries

With DynamoDB, you can create database tables that can store and retrieve any amount of data and serve any level of request traffic. You can scale up or scale down your table throughput capacity without downtime or performance degradation. DynamoDB automatically spreads the data and traffic for your tables over a sufficient number of servers to handle your throughput and storage requirements while maintaining consistent and fast performance. All of your data is stored on solid-state drives (SSDs) and automatically replicated across multiple Availability Zones in an AWS Region, providing built-in high availability and data durability. You can use global tables to keep DynamoDB tables in sync across AWS Regions.

Core Components of Amazon DynamoDB

In DynamoDB, tables, items, and attributes are the core components with which you work. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses partition keys to identify uniquely each item in a table. Secondary indexes can be used to provide more querying flexibility. You can use DynamoDB Streams to capture data modification events in DynamoDB tables.

Figure 4.10 shows the DynamoDB data model, including a table, items, attributes, a required partition key, an optional sort key, and an example of data being stored in partitions.

The figure shows the Amazon DynamoDB data model, including a table, items, attributes, required partition key, optional sort key, and an example of data being stored in partitions.

Figure 4.10 Amazon DynamoDB tables and partitions

Tables

Similar to other database systems, DynamoDB stores data in tables. A table is a collection of items. For example, a table called People could be used to store personal contact information about friends, family, or anyone else of interest.

Items

An item in DynamoDB is similar in many ways to rows, records, or tuples in other database systems. Each DynamoDB table contains zero or more items. An item is a collection of attributes that is uniquely identifiable for each record in that table. For a People table, each item represents a person. There is no limit to the number of items that you can store in a table.

Attributes

Each item is composed of one or more attributes. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems. An attribute is a fundamental data element, something that does not need to be broken down any further. You can think of an attribute as similar to columns in a relational database. For example, an item in a People table contains attributes called PersonID, Last Name, First Name, and so on.

Figure 4.11 shows a table named People with items and attributes. Each block represents an item, and within those blocks you have attributes that define the overall item:

  • Each item in the table has a unique identifier, a primary key, or a partition key that distinguishes the item from all of the others in the table. The primary key consists of one attribute (PersonID).
  • Other than the primary key, the People table is schemaless, which means that you do not have to define the attributes or their data types beforehand. Each item can have its own distinct attributes. This is where the contrast begins to show between NoSQL and SQL. In SQL, you would have to define a schema for each person, and every person would need to have the same data points or attributes. As you can see in Figure 4.11, with NoSQL and DynamoDB, each person can have different attributes.
  • Most of the attributes are scalar, so they can have only one value. Strings and numbers are common examples of scalars.
  • Some of the items have a nested attribute (Address). DynamoDB supports nested attributes up to 32 levels deep.

The figure shows a screenshot illustrating Amazon DynamoDB table with items and attributes.

Figure 4.11 Amazon DynamoDB table with items and attributes

Primary Key

When you create a table, at a minimum, you are required to specify the table name and primary key of the table. The primary key uniquely identifies each item in the table. No two items can have the same key within a table.

DynamoDB supports two different kinds of primary keys: partition key and partition key and sort key.

Partition key (hash key) A simple primary key, composed of one attribute, is known as the partition key. DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item is stored.

In a table that has only a partition key, no two items can have the same partition key value. For example, in the People table, with a simple primary key of PersonID, you cannot have two items with PersonID of 000-07-1075.

The partition key of an item is also known as its hash attribute. The term hash attribute derives from the use of an internal hash function in DynamoDB that evenly distributes data items across partitions based on their partition key values.

Each primary key attribute must be a scalar (meaning that it can hold only a single value). The only data types allowed for primary key attributes are string, number, or binary. There are no such restrictions for other, nonkey attributes.

Partition key and sort key (range attribute) A composite primary key is composed of two attributes: partition key and the sort key.

The sort key of an item is also known as its range attribute. The term range attribute derives from the way that DynamoDB stores items with the same partition key physically close together, in sorted order, by the sort key value.

The partition key acts the same as the sort key, but in addition to also using a sort key, the items with the same partition key are stored together, in sorted order, by sort key value.

In a table that has a partition key and a sort key, it’s possible for two items to have the same partition key value, but those two items must have different sort key values. You cannot have two items in the table that have identical partition key and sort key values.

For example, if you have a Music table with a composite primary key (Artist and SongTitle), you can access any item in the Music table directly if you provide the Artist and SongTitle values for that item.

A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist with a range of values for SongTitle.

As a developer, the attribute you choose for your application has important implications. If there is little differentiation among partition keys, all of your data is stored together in the same physical location.

Figure 4.12 shows an example of these two types of keys. In the SensorLocation table, the primary key is the SensorId attribute. This means that every item (or row) in this table has a unique SensorId, meaning that each sensor has exactly one location or latitude and longitude value.

The figure shows an example of Amazon DynamoDB primary keys.

Figure 4.12 Amazon DynamoDB primary keys

Conversely, the SensorReadings table has a partition key and a sort key. The SensorId attribute is the partition key and the Time attribute is the sort key, which combined make it a composite key. For each SensorId, there may be multiple items corresponding to sensor readings at different times. The combination of SensorId and Time uniquely identifies items in the table. This design enables you to query the table for all readings related to a particular sensor.

Secondary Indexes

If you want to perform queries on attributes that are not part of the table’s primary key, you can create a secondary index. By using a secondary index, you can query the data in the table by using an alternate key, in addition to querying against the primary key. DynamoDB does not require that you use indexes, but doing so may give you more flexibility when querying your data depending on your application and table design.

After you create a secondary index on a table, you can then read data from the index in much the same way as you do from the table. DynamoDB automatically creates indexes based on the primary key of a table and automatically updates all indexes whenever a table changes.

A secondary index contains the following:

  • Primary key attributes
  • Alternate key attributes
  • (Optional) A subset of other attributes from the base table (projected attributes)

DynamoDB provides fast access to the items in a table by specifying primary key values. However, many applications might benefit from having one or more secondary (or alternate) keys available. This allows efficient access to data with attributes other than the primary key.

DynamoDB supports two types of secondary indexes: local secondary indexes and global secondary indexes. You can define up to five global secondary indexes and five local secondary indexes per table.

Local Secondary Index

A local secondary index is an index that has the same partition key as the base table, but a different sort key (see Figure 4.13). It is “local” in the sense that every partition of a local secondary index is scoped to a base table partition that has the same partition key value. You can construct only one while creating the table, but you cannot add, remove, or modify it later.

The figure shows an example of local secondary index.

Figure 4.13 Local secondary index

Global Secondary Index

A global secondary index is an index with a partition key and a sort key that can be different from those on the base table (see Figure 4.14). It is considered “global” because queries on the index can span all of the data in the base table across all partitions. You can create one during table creation, and you can add, remove, or modify it later.

The figure shows an example of global secondary index.

Figure 4.14 Global secondary index

Symbol of Tip You can create a global secondary index, not a local secondary index, after table creation.

For example, by using a Music table, you can query data items by Artist (partition key) or by Artist and SongTitle (partition key and sort key). Suppose that you also wanted to query the data by Genre and Album Title. To do this, you could create a global secondary index on Genre and AlbumTitle and then query the index in much the same way as you’d query the Music table.

Figure 4.15 shows the example Music table with a new index called GenreAlbumTitle. In the index, Genre is the partition key, and AlbumTitle is the sort key.

The figure shows the example Music table with a new index called GenreAlbumTitle.

Figure 4.15 Amazon DynamoDB table and secondary index

Note the following about the GenreAlbumTitle index:

  • Every index belongs to a table, which is called the base table for the index. In the preceding example, Music is the base table for the GenreAlbumTitle index.
  • DynamoDB maintains indexes automatically. When you add, update, or delete an item in the base table, DynamoDB adds, updates, or deletes the corresponding item in any indexes that belong to that table.
  • When you create an index, you specify which attributes will be copied, or projected, from the base table to the index. At a minimum, DynamoDB projects the key attributes from the base table into the index. This is the case with GenreAlbumTitle, wherein only the key attributes from the Music table are projected into the index.

You can query the GenreAlbumTitle index to find all albums of a particular genre (for example, all Hard Rock albums). You can also query the index to find all albums within a particular genre that have certain album titles (for example, all Heavy Metal albums with titles that start with the letter M).

Comparison of Local Secondary Indexes and Global Secondary Indexes

To determine which type of index to use, consider your application’s requirements. Table 4.4 shows the main differences between a global secondary index and a local secondary index.

Table 4.4 Comparison of Local and Global Secondary Indexes

Characteristic Global Secondary Index Local Secondary Index
Query Scope Entire table, across all partitions. Single partition, as specified by the partition key value in the query.
Key Attributes
  • Partition key, or partition and sort key.
  • Can be any scalar attribute in the table.
  • Partition and sort key.
  • Partition key of index must be the same attribute as base table.
Projected Attributes Only projected attributes can be queried. Can query attributes that are not projected. Attributes are retrieved from the base table.
Read Consistency Eventual consistency only. Eventual consistency or strong consistency.
Provisioned Throughput
  • Separate throughput settings from base table.
  • Consumes separate capacity units.
  • Same throughput settings as base table.
  • Consumes base table capacity units.
Lifecycle Considerations Can be created or deleted at any time.
  • Must be created when the table is created.
  • Can be deleted only when the table is deleted.

Amazon DynamoDB Streams

Amazon DynamoDB Streams is an optional feature that captures data modification events in DynamoDB tables. The data about these events appears in the stream in near real time and in the order that the events occurred.

Each event is represented by a stream record. If you enable a stream on a table, DynamoDB Streams writes a stream record whenever one of the following events occurs:

  • A new item is added to the table—The stream captures an image of the entire item, including all of its attributes.
  • An item is updated—The stream captures the “before” and “after” images of any attributes that were modified in the item.
  • An item is deleted from the table—The stream captures an image of the entire item before it was deleted.

Each stream record also contains the name of the table, the event timestamp, and other metadata. Stream records have a lifetime of 24 hours; after that, they are automatically removed from the stream.

Figure 4.16 shows how you can use DynamoDB Streams together with AWS Lambda to create a trigger—code that executes automatically whenever an event of interest appears in a stream. For example, consider a Customers table that contains customer information for a company. Suppose that you want to send a “welcome” email to each new customer. You could enable a stream on that table and then associate the stream with a Lambda function. The Lambda function would execute whenever a new stream record appears, but only process new items added to the Customers table. For any item that has an EmailAddress attribute, the Lambda function could invoke Amazon Simple Email Service (Amazon SES) to send an email to that address.

The figure shows an example of Amazon DynamoDB Streams and AWS Lambda.

Figure 4.16 Example of Amazon DynamoDB Streams and AWS Lambda

Symbol of Note In the example shown in Figure 4.16, the last customer, Craig Roe, will not receive an email because he does not have an EmailAddress.

In addition to triggers, DynamoDB Streams enables other powerful solutions that developers can create, such as the following:

  • Data replication within and across AWS regions
  • Materialized views of data in DynamoDB tables
  • Data analysis by using Amazon Kinesis materialized views

Read Consistency

DynamoDB replicates data among multiple Availability Zones in a region. When your application writes data to a DynamoDB table and receives an HTTP 200 response (OK), all copies of the data are updated. The data is eventually consistent across all storage locations, usually within 1 second or less. DynamoDB supports both eventually consistent and strongly consistent reads.

Eventually Consistent Reads

When you read data from a DynamoDB table immediately after a write operation, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data. DynamoDB uses eventually consistent reads, unless you specify otherwise.

Strongly Consistent Reads

When querying data, you can specify whether DynamoDB should return strongly consistent reads. When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting updates from all prior write operations that were successful. A strongly consistent read might not be available if there is a network delay or outage.

Comparison of Consistent Reads

As a developer, it is important to understand the needs of your application. In some applications, eventually consistent reads might be fine, such as a high-score dashboard. In other applications or parts of an application, however, such as a financial or medical system, an eventually consistent read could be an issue. You will want to evaluate your data usage patterns to ensure that you are choosing the right type of reads for each part of your application.

There is an additional cost for strongly consistent reads, and they will have more latency in returning data than an eventually consistent read. So, that cost and timing should also play into your decision.

Read and Write Throughput

When you create a table or index in DynamoDB, you must specify your capacity requirements for read and write activity. By defining your throughput capacity in advance, DynamoDB can reserve the necessary resources to meet the read and write activity your application requires, while ensuring consistent, low-latency performance. Specify your required throughput value by setting the ProvisionedThroughput parameter when you create or update a table.

You specify throughput capacity in terms of read capacity units and write capacity units:

  • One read capacity unit (RCU) represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size. If you need to read an item that is larger than 4 KB, DynamoDB will need to consume additional read capacity units. The total number of read capacity units required depends on the item size and whether you want an eventually consistent or strongly consistent read.
  • One write capacity unit (WCU) represents one write per second for an item up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB must consume additional write capacity units. The total number of write capacity units required depends on the item size.

For example, suppose that you create a table with five read capacity units and five write capacity units. With these settings, your application could do the following:

  • Perform strongly consistent reads of up to 20 KB per second (4 KB × 5 read capacity units).
  • Perform eventually consistent reads of up to 40 KB per second (twice as much read throughput).
  • Write up to 5 KB per second (1 KB × 5 write capacity units).

If your application reads or writes larger items (up to the DynamoDB maximum item size of 400 KB), it consumes more capacity units.

If your read or write requests exceed the throughput settings for a table, DynamoDB can throttle that request. DynamoDB can also throttle read requests excess for an index. Throttling prevents your application from consuming too many capacity units. When a request is throttled, it fails with an HTTP 400 code (Bad Request) and a ProvisionedThroughputExceededException. The AWS SDKs have built-in support for retrying throttled requests, so you do not need to write this logic yourself.

DynamoDB provides the following mechanisms for managing throughput as it changes:

Amazon DynamoDB Auto Scaling DynamoDB automatic scaling actively manages throughput capacity for tables and global secondary indexes. With automatic scaling, you define a range (upper and lower limits) for read and write capacity units. You also define a target utilization percentage within that range. DynamoDB auto scaling seeks to maintain your target utilization, even as your application workload increases or decreases.

Provisioned throughput If you aren’t using DynamoDB auto scaling, you have to define your throughput requirements manually. As discussed, with this setting you may run into a ProvisionedThroughputExceededException if you are throttled. But you can change your throughput with a few clicks.

Reserved capacity You can purchase reserved capacity in advance, where you pay a one-time upfront fee and commit to a minimum usage level over a period of time. You may realize significant cost savings compared to on-demand provisioned throughput settings.

On-demand It can be difficult to plan capacity, especially if you aren’t collecting metrics or perhaps are developing a new application and you aren’t sure what type of performance you require. With On-Demand mode, your DynamoDB table will automatically scale up or down to any previously reached traffic level. If a workload’s traffic level reaches a new peak, DynamoDB rapidly adapts to accommodate the workload. As a developer, focus on making improvements to your application and offload scaling activities to AWS.

Partitions and Data Distribution

When you are using a table in DynamoDB, the data is placed on multiple partitions (depending on the amount of data and the amount of throughput allocated to it; recall that throughput is determined by RCUs and WCUs). When you allocate RCUs and WCUs to a table, those RCUs and WCUs are split evenly among all partitions for your table.

For example, suppose that you have allocated 1,000 RCUs and 1,000 WCUs to a table, and this table has 10 partitions allocated to it. Then each partition would have 100 RCUs and 100 WCUs for it to use. If one of your partitions consumes all the RCUs and WCUs for the table, you may receive a ProvisionedThroughputExceededException error because one of your partitions is hot. To deal with hot partitions, DynamoDB has two features: burst capacity and adaptive capacity.

Burst Capacity

The previous example discussed how you had 10 partitions, each with 100 RCUs and 100 WCUs allocated to them. One of your partitions begins to become hot and now needs to consume more than 100 RCUs. Under normal circumstances, you may receive the ProvisionedThroughputExceededException error. However, with burst capacity, whenever your partition is not using all of its total capacity, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle any spike your partition may experience.

At the time of this writing, DynamoDB currently reserves up to 300 seconds (5 minutes) of unused read and write capacity, which means that your partition can handle a peak load for 5 minutes over its normal expected load. Burst capacity is enabled and runs in the background.

Adaptive Capacity

Adaptive capacity is when it is not always possible to distribute read and write activity to a partition evenly. In the example, a partition is experiencing not only peak demand but also consistent demand over and above its normal 100 RCU and 100 WCUs. Suppose that now this partition requires 200 RCUs instead of 100 RCUs.

DynamoDB adaptive capacity enables your application to continue reading and writing to hot partitions without being throttled, provided that the total provisioned capacity for the table is not exceeded. DynamoDB allocates additional RCUs to the hot partition; in this case, 100 more. With adaptive capacity, you will still be throttled for a period of time, typically between 5–30 minutes, before adaptive capacity turns on or activates. So, for a portion of time, your application will be throttled; however, after adaptive capacity allocates the RCUs to the partition, DynamoDB is able to sustain the new higher throughput for your partition and table. Adaptive capacity is on by default, and there is no need to enable or disable it.

Retrieving Data from DynamoDB

Two primary methods are used to retrieve data from DynamoDB: Query and Scan.

Query

In DynamoDB, you perform Query operations directly on the table or index. To run the Query command, you must specify, at a minimum, a primary key. If you are querying an index, you must specify both TableName and IndexName.

The following is a query on a Music table in DynamoDB using the Python SDK:

import boto3
import json
import decimal
 
# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            if o % 1 > 0:
                return float(o)
            else:
                return int(o)
        return super(DecimalEncoder, self).default(o)
 
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
 
table = dynamodb.Table('Music')
 
print("A query with DynamoDB")
 
response = table.query(
    KeyConditionExpression=Key('Artist').eq('Sam Samuel')
)
 
for i in response['Items']:
    print(i['SongTitle'], "-", i['Genre'], i['Price'])

The query returns all of the songs by the artist Sam Samuel in the Music table.

Scan

You can also perform Scan operations on a table or index. The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index. To have DynamoDB return fewer items, you can provide a FilterExpression operation.

If the total number of scanned items exceeds the maximum dataset size limit of 1 MB, the scan stops, and the results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation. The results also include the number of items exceeding the limit. A scan can result in no table data meeting the filter criteria.

A single Scan operation reads up to the maximum number of items set (if you’re using the Limit parameter) or a maximum of 1 MB of data and then applies any filtering to the results by using FilterExpression. If LastEvaluatedKey is present in the response, you must paginate the result set.

Scan operations proceed sequentially; however, for faster performance on a large table or secondary index, applications can request a parallel Scan operation by providing the Segment and TotalSegments parameters.

Scan uses eventually consistent reads when accessing the data in a table; therefore, the result set might not include the changes to data in the table immediately before the operation began. If you need a consistent copy of the data, as of the time that the Scan begins, you can set the ConsistentRead parameter to true.

The following is a scan on a Movies table with the Python SDK:

// Return all of the data in the index
import boto3
import json
import decimal
 
# Create the DynamoDB Resource
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
 
# Use the Music Table
table = dynamodb.Table('Music')
 
# Helper class to convert a DynamoDB decimal/item to JSON.
class DecimalEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            if o % 1 > 0:
                return float(o)
            else:
                return int(o)
        return super(DecimalEncoder, self).default(o)
 
# Specify some filters for the scan
# Here we are stating that the Price must be between 12 - 30
fe = Key('Price').between(12, 30)
pe = "#g, Price"
# Expression Attribute Names for Projection Expression only.
ean = { "#g": "Genre", }
 
#
response_scan = table.scan(
    FilterExpression=fe,
    ProjectionExpression=pe,
    ExpressionAttributeNames=ean
    )
 
# Print all the items
for i in response_scan['Items']:
    print(json.dumps(i, cls=DecimalEncoder))
while 'LastEvaluatedKey' in response:
    response = table.scan(
        ProjectionExpression=pe,
        FilterExpression=fe,
        ExpressionAttributeNames= ean,
        ExclusiveStartKey=response['LastEvaluatedKey']
        )
    for i in response['Items']:
        print(json.dumps(i)

As you can see from the Python code, the scan returns all records with a price of between 12 and 30 and the genre and the price. The LastEvaluatedKey property is included to continue to loop through the entire table.

Global Tables

Global tables build upon the DynamoDB global footprint to provide you with a fully managed, multiregion, and multimaster database that provides fast, local, read-and-write performance for massively scaled, global applications. DynamoDB performs all the necessary tasks to create identical tables in these regions and propagate ongoing data changes to all of them. Figure 4.17 shows an example of how global tables can work with a global application and globally dispersed users.

The figure shows an example of how global tables can work with a global application and globally dispersed users.

Figure 4.17 Global tables

A global table is a collection of one or more DynamoDB tables, all owned by a single AWS account, identified as replica tables. A replica table (or replica, for short) is a single DynamoDB table that functions as a part of a global table. Each replica stores the same set of data items. Any given global table can have only one replica table per region, and every replica has the same table name and the same primary key schema. Changes made in one replica are recorded in a stream and propagated to other replicas, as shown in Figure 4.18.

The figure shows replication flow in global tables.

Figure 4.18 Replication flow in global tables

If your application requires strongly consistent reads, then it must perform all of its strongly consistent reads and writes in the same region. DynamoDB does not support strongly consistent reads across AWS Regions.

Conflicts can arise if applications update the same item in different regions at about the same time (concurrent updates). To ensure eventual consistency, DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates whereby DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all replicas agree on the latest update and converge toward a state in which they all have identical data.

To create a DynamoDB global table, perform the following steps:

  1. Create an ordinary DynamoDB table, with DynamoDB Streams enabled, in an AWS Region.
  2. Repeat step 1 for every other AWS Region where you want to replicate your data.
  3. Define a DynamoDB global table based on the tables that you have created.

The AWS Management Console automates these tasks so that you can create a global table quickly and easily.

Object Persistence Model

The DynamoDB object persistence model enables you to map client-side classes to DynamoDB tables. The instances of these classes (objects) map to items in a DynamoDB table.

You can use the object persistence programming interface to connect to DynamoDB; perform create, read, update, and delete operations (CRUD); execute queries; and implement optimistic locking with a version number. Figure 4.19 shows an example of using the object persistence model to map client-side objects in DynamoDB.

The figure shows an example of using the object persistence model to map client-side objects in Amazon DynamoDB.

Figure 4.19 Object persistence model

Support for the object persistence model is available in the Java and .NET SDKs.

Amazon DynamoDB Local

DynamoDB Local is the downloadable version of DynamoDB that lets you write and test applications by using the Amazon DynamoDB API without accessing the DynamoDB web service. Instead, the database is self-contained on your computer. When you’re ready to deploy your application in production, you can make a few minor changes to the code so that it uses the DynamoDB web service.

Having this local version helps you save on provisioned throughput, data storage, and data transfer fees. In addition, you don’t need an internet connection while you’re developing your application.

IAM and Fine-Grained Access Control

You can use AWS IAM to grant or restrict access to DynamoDB resources and API actions. For example, you could allow a user to execute the GetItem operation on a Books table. DynamoDB also supports fine-grained access control so that you can control access to individual data items and attributes. This means that perhaps you have a Users table and you want the specific user to have access only to his or her data. You can accomplish this with fine-grained access control. Use a condition inside an IAM policy with the dynamodb:LeadingKeys property.

By using LeadingKeys, you can limit the user so that they can access only the items where the partition key matches the userID. In the following example in the Users table, you want to restrict who can view the profile information to only the user to which the data or profile information belongs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LeadingKeysExample",
            "Effect": "Allow",
            "Action": [
                "dynamodb:GetItem",
                "dynamodb:BatchGetItem",
                "dynamodb:Query",
                "dynamodb:PutItem",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:accountnumber:table/UserProfiles"
            ],
            "Condition": {
                "ForAllValues:StringEquals": {
                    "dynamodb:LeadingKeys": [
                        "${www.amazon.com:user_id}"
              
                    ],
                    "dynamodb:Attributes": [
                        "UserId",
                        "FirstName",
                        "LastName",
                        "Email",
                        "Birthday"
                    ]
                },
                "StringEqualsIfExists": {
                    "dynamodb:Select": "SPECIFIC_ATTRIBUTES"
                }
            }
        }
    ]
}

As you can see in the IAM policy, only the specific user is allowed to access a subset of the total attributes that are defined in the Attributes section of the policy. Furthermore, the SELECT statement specifies that the application must provide a list of specific attributes to act upon, preventing the application from requesting all attributes.

Backup and Restore

You can create on-demand backups and enable point-in-time recovery for your DynamoDB tables.

On-demand backups create full backups of your tables or restore them on-demand at any time. These actions execute with zero impact on table performance or availability and without consuming any provisioned throughput on the table.

Point-in-time recovery helps protect your DynamoDB tables from accidental write or delete operations. For example, suppose that a test script accidentally writes to a production DynamoDB table. With point-in-time recovery, you can restore that table to any point in time during the last 35 days. DynamoDB maintains incremental backups of your table. These operations will not affect performance or latency.

Encryption with Amazon DynamoDB

DynamoDB offers fully managed encryption at rest, and it is enabled by default. DynamoDB uses AWS KMS for encrypting the objects at rest. By default, DynamoDB uses the AWS-owned customer master key (CMK); however, you can also specify your own AWS KMS CMK key that you have created. For more information on AWS KMS, see Chapter 5, “Encryption on AWS.”

Amazon DynamoDB Best Practices

Now that you understand what DynamoDB is and how you can use it to create a scalable database for your application, review some best practices for using DynamoDB.

Distribute Workload Evenly

The primary key or partition key portion of a table’s primary key determines the logical partitions in which the table’s data is stored. These logical partitions also affect the underlying physical partitions. As a result, you want to distribute your workload across the partitions as evenly as possible, reducing the number of “hot” partition issues that may arise.

Table 4.5 compares the more common partition key schemas and whether they are good for DynamoDB.

Table 4.5 Amazon DynamoDB Partition Key Recommended Strategies

Partition Key Value Uniformity
User ID, where the application has many users Good
Status code, where there are only a few possible status codes Bad
Item creation date, rounded to the nearest time period (for example, day, hour, or minute) Bad
Device ID, where each device accesses data at relatively similar intervals Good
Device ID, where even if there are many devices being tracked, one is by far more popular than all the others Bad
Comparison of Query and Scan Operations

The Query operation finds items in a table based on primary key values. You must provide the name of the partition key attribute and the value of that attribute. You can provide a sort key attribute name and value to refine the search results (for example, all of the forums with this ID in the last seven days). By default, Query returns all of the data attributes for those items with specified primary keys. The results are sorted by the sort key in ascending order, which can be reversed. Additionally, queries are set to be Eventually Consistent, with an option to change to Strongly Consistent, if necessary.

The Scan operation returns all of the item attributes by accessing every item in the table. It is for this reason that Query is more efficient than the Scan operation.

Data Warehouse

If you are performing analytics, you may want to use a data warehouse. A data warehouse is a central repository of information that you can analyze to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through BI tools, SQL clients, and other analytics applications.

Data Warehouse Architecture

A data warehouse architecture consists of three tiers. The bottom tier of the architecture is the database server, where data is loaded and stored. The middle tier consists of the analytics engine that is used to access and analyze the data. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools.

A data warehouse works by organizing data into a schema that describes the layout and type of data, such as integer, data field, or string. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.

Data Warehouse Benefits

Benefits of using a data warehouse include the following:

  • Better decision-making
  • Consolidation of data from many sources
  • Data quality, consistency, and accuracy
  • Historical intelligence
  • Analytics processing that is separate from transactional databases, improving the performance of both systems

The data warehousing landscape has changed dramatically in recent years with the emergence of cloud-based services that offer high performance, simple deployment, near-infinite scaling, and easy administration at a fraction of the cost of on-premises solutions.

Comparison of Data Warehouses and Databases

A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction. Table 4.6 is useful in comparing the characteristics of data warehouses and databases.

Table 4.6 Comparison of Data Warehouse and Database Characteristics

Characteristics Data Warehouse Transactional Database
Suitable Workloads Analytics, reporting, big data Transaction processing
Data Source Data collected and normalized from many sources Data captured as-is from a single source, such as a transactional system
Data Capture Bulk write operations typically on a predetermined batch schedule Optimized for continuous write operations as new data is available to maximize transaction throughput
Data Normalization Denormalized schemas, such as the star schema or snowflake schema Highly normalized, static schemas
Data Storage Optimized for simplicity of access and high-speed query performance by using columnar storage Optimized for high-throughout write operations to a single row-oriented physical block
Data Access Optimized to minimize I/O and maximize data throughput High volumes of small read operations

Comparison of Data Warehouses and Data Lakes

Unlike a data warehouse, a data lake, as described in Chapter 3, “Hello, Storage,” is a centralized repository for all data, including structured and unstructured. A data warehouse uses a predefined schema that is optimized for analytics. In a data lake, the schema is not defined, enabling additional types of analytics, such as big data analytics, full text search, real-time analytics, and machine learning. Table 4.7 compares the characteristics of a data warehouse and a data lake.

Table 4.7 Comparison of Data Warehouse and Data Lake Characteristics

Characteristics Data Warehouse Data Lake
Data Relational data from transactional systems, operational databases, and line-of-business applications Nonrelational and relational data from IoT devices, websites, mobile apps, social media, and corporate applications
Schema Designed before the data warehouse implementation (schema-on-write) Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results by using higher-cost storage Query results getting faster by using low-cost storage
Data Quality Highly curated data that serves as the central version of the truth Any data that may or may not be curated (in other words, raw data)
Users Business analysts, data scientists, and data developers Data scientists, data developers, and business analysts (using curated data)
Analytics Batch reporting, BI, and visualizations Machine learning, predictive analytics, data discovery, and profiling

Comparison of Data Warehouses and Data Marts

A data mart is a data warehouse that serves the needs of a specific team or business unit, such as finance, marketing, or sales. It is smaller, is more focused, and may contain summaries of data that best serve its community of users. Table 4.8 compares the characteristics of a data warehouse and a data mart.

Table 4.8 Comparison of Data Warehouse and Data Mart Characteristics

Characteristics Data Warehouse Transactional Database
Scope Centralized, multiple subject areas integrated together Decentralized, specific subject area
Users Organization-wide A single community or department
Data Source Many sources A single or a few sources, or a portion of data already collected in a data warehouse
Size Large—can be 100s of gigabytes to petabytes Small, generally up to 10s of gigabytes
Design Top-down Bottom-up
Data Detail Complete, detailed data May hold summarized data

Amazon Redshift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data by using standard SQL and your existing BI tools. With Amazon Redshift, you can run complex analytic queries against petabytes of structured data using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. Most results come back in seconds. Amazon Redshift is up to 10 times faster than traditional on-premises data warehouses at 1/10 the cost.

Architecture

An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. After you provision your cluster, you can upload your dataset and then perform data analysis queries. Each cluster has a leader node and one or more compute nodes, and you have a choice of a hardware platform for your cluster.

Client Applications

Amazon Redshift integrates with various data loading and extract, transform, and load (ETL) tools and BI reporting, data mining, and analytics tools. It is based on open standard PostgreSQL, so most existing SQL client applications will integrate with Amazon Redshift with only minimal changes. For important differences between Amazon Redshift SQL and PostgreSQL, see the Amazon Redshift documentation.

Leader Node

The leader node acts as the SQL endpoint and receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates a parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results to the client applications. The leader node also stores metadata about the cluster. Amazon Redshift communicates with client applications by using open standard PostgreSQL, JDBC, and ODBC drivers.

Compute Nodes

Compute nodes execute the query execution plan and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.

Node Slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and allocates the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The node size of the cluster determines the number of slices per node.

Figure 4.20 shows the Amazon Redshift data warehouse architecture, including the client applications, JDBC and ODBC connections, leader node, compute nodes, and node slices.

The figure shows the Amazon Redshift data warehouse architecture, including the client applications, JDBC and ODBC connections, leader node, compute nodes, and node slices.

Figure 4.20 Amazon Redshift architecture

Databases

A cluster contains one or more databases. User data is stored on the compute nodes.

Hardware Platform Options

When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. There are two categories for node types. The dense storage (DS) node types are storage-optimized using large magnetic disks and can provide up to 2 PB of storage capacity. The dense compute (DC) node types are compute-optimized. Because they use solid state drive (SSD) storage, they deliver much faster I/O compared to DS node types but provide less storage space at a maximum of 326 TB.

Table Design

Each database within an Amazon Redshift cluster can support many tables. Like most SQL-based databases, you can create a table using the CREATE TABLE command. This command specifies the name of the table, the columns, and their data types. This command also supports specifying compression encodings, distribution strategy, and sort keys in Amazon Redshift.

Data Types

Each value that Amazon Redshift stores or retrieves has a data type with a fixed set of associated properties. Data types are declared when tables are created. Additional columns can be added to a table by using the ALTER TABLE command, but you cannot change the data type on an existing column.

Many familiar data types are available, including the following:

Numeric data types

  • BIGINT
  • DECIMAL
  • DOUBLE PRECISION
  • INTEGER
  • REAL
  • SMALLINT

Text data types

  • CHAR
  • VARCHAR

Date data types

  • DATE
  • TIMESTAMP
  • TIMESTAMPTZ

Logical data type

  • BOOLEAN
Compression Encoding

Amazon Redshift uses data compression as one of the key performance optimizations. When you load data for the first time into an empty table, Amazon Redshift samples your data automatically and selects the best compression scheme for each column. Alternatively, you can specify your preferred compression encoding on a per-column basis as part of the CREATE TABLE command.

Distribution Strategy

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table’s distribution style. When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

This is one of the primary decisions when you’re creating a table in Amazon Redshift. You can configure the distribution style of a table to give Amazon Redshift hints as to how the data should be partitioned to meet your query patterns. The style you select for your database affects query performance, storage requirements, data loading, and maintenance. By choosing the best distribution strategy for each table, you can balance your data distribution and significantly improve overall system performance.

When creating a table, you can choose among one of the three distribution styles: EVEN, KEY, or ALL.

EVEN distribution Rows are distributed across the slices in a round-robin fashion, regardless of the values in any particular column. It is an appropriate choice when a table does not participate in joins or when there is not a clear choice between KEY distribution or ALL distribution. EVEN is the default distribution type.

KEY distribution Rows are distributed according to the values in one column. The leader node attempts to place matching values on the same node slice. Use this style when you will be querying heavily against values of a specific column.

ALL distribution A copy of the entire table is distributed to every node. This ensures that every row is collocated for every join in which the table participates. This multiplies the storage required by the number of nodes in the cluster, and it takes much longer to load, update, or insert data into multiple tables. Use this style only for relatively slow-moving tables that are not updated frequently or extensively.

Sort Keys

Another important decision to make during table creation is choosing the appropriate sort key. Amazon Redshift stores your data on disk in sorted order according to the sort key, and the query optimizer uses sort order when it determines the optimal query plans. Specify an appropriate sort key for the way that your data will be queried, filtered, or joined.

The following are some general guidelines for choosing the best sort key:

  • If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
  • If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
  • If you frequently join a table, specify the join column as both the sort key and the distribution key.

Loading Data

Loading large datasets can take a long time and consume many computing resources. How your data is loaded can also affect query performance. You can reduce these impacts by using COPY commands, bulk inserts, and staging tables when loading data into Amazon Redshift.

Symbol of Tip The COPY command loads data in parallel from Amazon S3 or other data sources in a more efficient manner than INSERT commands.

Querying Data

You can query Amazon Redshift tables by using standard SQL commands, such as using SELECT statements, to query and join tables. For complex queries, you are able to analyze the query plan to choose better optimizations for your specific access patterns.

For large clusters supporting many users, you can configure workload management (WLM) to queue and prioritize queries.

Snapshots

Amazon Redshift supports snapshots, similar to Amazon RDS. You can create automated and manual snapshots, which are stored in Amazon S3 by using an encrypted Secure Socket Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify.

When you restore from a snapshot, Amazon Redshift creates a new cluster and makes it available before all of the data is loaded so that you can begin querying the new cluster immediately. Amazon Redshift will stream data on demand from the snapshot in response to active queries and load all the remaining data in the background.

Achieving proper durability for a database requires more effort and more attention. Even when using Amazon Elastic Block Store (Amazon EBS) volumes, take snapshots on a frozen file system to be consistent. Also, restoring a database might require additional operations other than restoring a volume from a snapshot and attaching it to an Amazon EC2 instance.

Security

Securing your Amazon Redshift cluster is similar to securing other databases running in the AWS Cloud. To meet your needs, you will use a combination of IAM policies, security groups, and encryption to secure the cluster.

Encryption

Protecting the data stored in Amazon Redshift is an important aspect of your security design. Amazon Redshift supports encryption of data in transit using SSL-encrypted connections.

You can also enable database encryption for your clusters to help protect data at rest. AWS recommends enabling encryption for clusters that contain sensitive data. You might be required to use encryption depending on the compliance guidelines or regulations that govern your data. Encryption is an optional setting, and it must be configured during the cluster launch. To change encryption on a cluster, you need to create a new cluster and migrate the data.

Amazon Redshift automatically integrates with AWS KMS.

Symbol of Tip Implement security at every level of your Amazon Redshift architecture, including the infrastructure resources, database schema, data, and network access.

Access to Amazon Redshift resources is controlled at three levels: cluster management, cluster connectivity, and database access. For details on the controls available to help you manage each of these areas, see the Amazon Redshift Cluster Management Guide at https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html.

The following are some best practices for securing your Amazon Redshift deployments:

  • Enable and use SSL when connecting to the Amazon Redshift database port.
  • Ensure that your data is available only via SSL by setting the require_ssl parameter to true in the parameter group that is associated with the cluster.
  • Use long, random database passwords generated by Amazon Redshift and store them by using a secret management system.
  • Enable cluster encryption.
  • Secure the S3 bucket by enabling Amazon S3 encryption and configuring access control for Amazon S3.
  • Secure the ETL system by enacting access control, auditing/logging, patch management, disk encryption/secure deletion, and SSL connectivity to Amazon S3.
  • Secure the BI system by enacting access control, auditing, patching, SSL connectivity to Amazon Redshift, and SSL UI (if applicable).
  • Use cluster or VPC security groups to limit Amazon Redshift access only to the necessary IP addresses (for both inbound and outbound flows).
  • Enable cluster encryption.

Amazon Redshift Spectrum

Amazon Redshift also includes Redshift Spectrum, allowing you to run SQL queries directly against exabytes of unstructured data in Amazon S3. No loading or transformation is required.

You can use many open data formats, including Apache Avro, CSV, Grok, Ion, JSON, Optimized Row Columnar (ORC), Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV. Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run fast, regardless of dataset size.

To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that’s connected to your cluster so that you can execute SQL commands. The cluster and the data files in Amazon S3 must be in the same AWS Region.

In-Memory Data Stores

In-memory data stores are used for caching and real-time workloads. AWS provides a variety of in-memory, key-value database options. You can operate your own nonrelational key-value data store in the cloud on Amazon EC2 and Amazon EBS, work with AWS solution providers, or take advantage of fully managed nonrelational services such as Amazon ElastiCache.

Caching

In computing, the data in a cache is generally stored in fast-access hardware, such as random-access memory (RAM), and may also be used in correlation with a software component. A cache’s primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.

Trading off capacity for speed, a cache typically stores a subset of data transiently, in contrast to databases whose data is usually complete and durable.

Benefits of Caching

A cache provides high-throughput, low-latency access to commonly accessed application data by storing the data in memory. Caching can improve the speed of your application. Caching reduces the response latency, which improves a user’s experience with your application.

Time-consuming database queries and complex queries often create bottlenecks in applications. In read-intensive applications, caching can provide large performance gains by reducing application processing time and database access time. Write-intensive applications typically do not see as great a benefit to caching. However, even write-intensive applications normally have a read/write ratio greater than 1, which implies that read caching can be beneficial. In summary, the benefits of caching include the following:

  • Improve application performance
  • Reduce database cost
  • Reduce load on the backend database tier
  • Facilitate predictable performance
  • Eliminate database hotspots
  • Increase read throughput (IOPS)

The following types of information or applications can often benefit from caching:

  • Results of database queries
  • Results of intensive calculations
  • Results of remote API calls
  • Compute-intensive workloads that manipulate large datasets, such as high-performance computing simulations and recommendation engines

Consider caching your data if the following conditions apply:

  • It is slow or expensive to acquire when compared to cache retrieval.
  • It is accessed with sufficient frequency.
  • Your data or information for your application is relatively static
  • Your data or information for your application is rapidly changing and staleness is not significant.

Caching Strategies

You can implement different caching strategies for your application. Two primary methods are lazy loading and write through. A cache hit occurs when the cache contains the information requested. A cache miss occurs when the cache does not contain the information requested.

Lazy loading Lazy loading is a caching strategy that loads data into the cache only when necessary. When your application requests data, it first makes the request to the cache. If the data exists in the cache (a cache hit), it is retrieved; but if it does not or has expired (a cache miss), then the data is retrieved from your data store and then stored in the cache. The advantage of lazy loading is that only the requested data is cached. The disadvantage is that there is a cache miss penalty resulting in three trips:

  1. The application requests data from the cache.
  2. If there is a cache miss, you must query the database.
  3. After data retrieval, the cache is updated.

Write through The write-through strategy adds data or updates in the cache whenever data is written to the database. The advantage of write through is that the data in the cache is never stale. The disadvantage is that there is a write penalty because every write involves two trips: a write to the cache and a write to the database. Another disadvantage is that because most data is never read in many applications, the data or information that is stored in the cluster is never used. This storage incurs a cost for space and overhead due to the duplicate data. In addition, if your data is updated frequently, the cache may be updating often, causing cache churn.

In-Memory Key-Value Store

An in-memory key-value store is a NoSQL database optimized for read-heavy application workloads (such as social networking, gaming, media sharing, and Q&A portals) or compute-intensive workloads (such as a recommendation engine). In-memory caching improves application performance by storing critical pieces of data in memory for low-latency access. Cached information may include the results of I/O-intensive database queries or the results of computationally intensive calculations.

Benefits of In-Memory Data Stores

The strict performance requirements imposed by real-time applications mandate more efficient databases. Traditional databases rely on disk-based storage. A single user action may consist of multiple database calls. As they accumulate, latency increases. However, by accessing data in memory, in-memory data stores provide higher throughput and lower latency. In fact, in-memory data stores can be one to two orders of magnitude faster than disk-based databases.

As a NoSQL data store, an in-memory data store does not share the architectural limitations found in traditional relational databases. NoSQL data stores are built to be scalable. Traditional relational databases use a rigid table-based architecture. Some NoSQL data stores use a key-value store and therefore don’t enforce a structure on the data. This enables scalability and makes it easier to grow, partition, or shard data as data stores grow. When consumed as a cloud-based service, an in-memory data store also provides availability and cost benefits. On-demand access allows organizations to scale their applications as needed in response to demand spikes and at a lower cost than disk-based stores. Using managed cloud services also eliminates the need to administer infrastructure. Database hotspots are reduced, and performance becomes more predictable. Some cloud-based services also offer the benefit of high availability with replicas and support for multiple Availability Zones.

Benefits of Distributed Cache

A caching layer helps further drive throughput for read-heavy applications. A caching layer is a high-speed storage layer that stores a subset of data. When a read request is sent, the caching layer checks to determine whether it has the answer. If it doesn’t, the request is sent on to the database. Meeting read requests through the caching layer in this manner is more efficient and delivers higher performance than what can be had from a traditional database alone.

It is also more cost-effective. A single node of in-memory cache can deliver the same read throughput as several database nodes. Instead of provisioning additional instances of your traditional database to accommodate a demand spike, you can drive more throughput by adding one node of distributed cache, replacing several database nodes. The caching layer saves you money because you’re paying for one node instead of multiple database nodes, and you get the added benefit of dramatically faster performance for reads.

Amazon ElastiCache

Developers need a way to maintain super-low latency, even as they accommodate spikes in demand and while controlling infrastructure and database costs and load.

Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the AWS Cloud. The service improves the performance of web applications by allowing you to retrieve information from fast, managed, in-memory caches instead of relying entirely on slower disk-based databases.

ElastiCache automatically detects and replaces failed nodes, reducing the overhead associated with self-managed infrastructures and also provides a resilient system that mitigates the risk of overloaded cloud databases, which slow website and application load times.

ElastiCache currently supports two different open-source, in-memory, key-value caching engines: Redis and Memcached. Each engine provides some advantages.

Redis

Redis is an increasingly popular open-source, key-value store that supports more advanced data structures, such as sorted sets, hashes, and lists. Unlike Memcached, Redis has disk persistence built in, meaning that you can use it for long-lived data. Redis also supports replication, which can be used to achieve Multi-AZ redundancy, similar to Amazon RDS.

Memcached

Memcached is a widely adopted in-memory key store. It is historically the gold standard of web caching. ElastiCache is protocol-compliant with Memcached, and it is designed to work with popular tools that you use today with existing Memcached environments. Memcached is also multithreaded, meaning that it makes good use of larger Amazon EC2 instance sizes with multiple cores.

Comparison of Memcached and Redis

Although both Memcached and Redis appear similar on the surface, in that they are both in-memory key stores, they are quite different in practice. Because of the replication and persistence features of Redis, ElastiCache manages Redis more as a relational database. Redis ElastiCache clusters are managed as stateful entities that include failover, similar to how Amazon RDS manages database failover.

Conversely, because Memcached is designed as a pure caching solution with no persistence, ElastiCache manages Memcached nodes as a pool that can grow and shrink, similar to an Amazon EC2 Auto Scaling group. Individual nodes are expendable, and ElastiCache provides additional capabilities here, such as automatic node replacement and Auto Discovery.

Consider the following requirements when deciding between Memcached and Redis.

Use Memcached if you require one or more of the following:

  • Object caching is your primary goal, for example, to offload your database.
  • You are interested in as simple a caching model as possible.
  • You plan to run large cache nodes and require multithreaded performance with the use of multiple cores.
  • You want to scale your cache horizontally as you grow.

Use Redis if you require one or more of the following:

  • You are looking for more advanced data types, such as lists, hashes, and sets.
  • Sorting and ranking datasets in memory help you, such as with leaderboards.
  • Your application requires publish and subscribe (pub/sub) capabilities.
  • Persistence of your key store is important.
  • You want to run in multiple Availability Zones (Multi-AZ) with failover.
  • You want transactional support, which lets you execute a group of commands as an isolated and atomic operation.

Amazon DynamoDB Accelerator

Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to 10 times the performance improvement—from milliseconds to microseconds—even at millions of requests per second. DAX does all of the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.

With DAX, you can focus on building great applications for your customers without worrying about performance at scale. You do not need to modify application logic, because DAX is compatible with existing DynamoDB API calls. You can enable DAX with a few clicks in the AWS Management Console or by using the AWS SDK, and you pay only for the capacity you provision.

Graph Databases

AWS provides a variety of graph database options, such as Amazon Neptune, or you can operate your own graph database in the cloud on Amazon EC2 and Amazon EBS. This section takes a closer look at what exactly is a graph database and when you would want to use one.

Many applications being built today must understand and navigate relationships between highly connected data. This can enable use cases like the following:

  • Social applications
  • Recommendation engines
  • Fraud detection
  • Knowledge graphs
  • Life sciences
  • IT/network

Because the data is highly connected, it is easily represented as a graph, which is a data structure that consists of vertices and directed links called edges. Vertices and edges can each have properties associated with them. Figure 4.21 depicts a simple graph of relationships between friends and their interests—or social network—that could be stored and queried by using a graph database. A graph database is optimized to store and process graph data.

Illustration shows an example of a social network diagram.

Figure 4.21 Example of a social network diagram

AWS provides a variety of graph database options. You can operate your own graph database in the cloud on Amazon EC2 and Amazon EBS. You can also use Neptune, a fully managed graph database service.

Amazon Neptune

Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency.

Neptune is highly available and provides the following features:

  • Read replicas
  • Point-in-time recovery
  • Continuous backup to Amazon S3
  • Replication across Availability Zones
  • Encryption at rest and in transit

Figure 4.22 shows a knowledge graph that can be powered by Neptune.

Illustration shows an example of a graph database architecture running on Amazon Neptune.

Figure 4.22 Example of a graph database architecture running on Amazon Neptune

Neptune supports the popular graph models Property Graph and W3C’s RDS and their respective query languages Apache TinkerPop Gremlin and SPARQL. With these models, you can easily build queries that efficiently navigate highly connected datasets. Neptune graph databases include the following use cases:

  • Recommendation engines
  • Fraud detection
  • Knowledge graphs
  • Drug discovery
  • Network security

Cloud Database Migration

Data is the cornerstone of successful cloud application deployments. Your evaluation and planning process may highlight the physical limitations inherent to migrating data from on-premises locations into the cloud. Amazon offers a suite of tools to help you move data via networks, roads, and technology partners.

This chapter focuses on the AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT). Customers also use other AWS services and features that are discussed in Chapter 3, “Hello, Storage,” for cloud data migration, including the following:

  • AWS Direct Connect (DX)
  • AWS Snowball
  • AWS Snowball Edge
  • AWS Snowmobile
  • AWS Import/Export Disk
  • AWS Storage Gateway
  • Amazon Kinesis Data Firehose
  • Amazon S3 Transfer Acceleration
  • Virtual private network (VPN) connections

AWS Database Migration Service

AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. AWS DMS can migrate your data to and from the most widely used commercial and open-source databases.

The service supports homogenous database migrations, such as Oracle to Oracle, in addition to heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. You can also stream data to Amazon Redshift, Amazon DynamoDB, and Amazon S3 from any of the supported sources, such as Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, SAP ASE, SQL Server, IBM DB2 LUW, and MongoDB, enabling consolidation and easy analysis of data in a petabyte-scale data warehouse. You can also use AWS DMS for continuous data replication with high availability.

Figure 4.23 shows an example of both heterogeneous and homogenous database migrations.

The figure shows an example of both heterogeneous and homogenous database migrations.

Figure 4.23 Homogenous database migrations using AWS DMS

To perform a database migration, AWS DMS connects to the source data store, reads the source data, and formats the data for consumption by the target data store. It then loads the data into the target data store. Most of this processing happens in memory, though large transactions might require some buffering to disk. Cached transactions and log files are also written to disk.

At a high level, when you’re using AWS DMS, complete the following tasks:

  • Create a replication server.
  • Create source and target endpoints that have connection information about your data stores.
  • Create one or more tasks to migrate data between the source and target data stores.

A task can consist of three major phases:

  • The full load of existing data
  • The application of cached changes
  • Ongoing replication

AWS Schema Conversion Tool

For heterogeneous database migrations, AWS DMS uses the AWS Schema Conversion Tool (AWS SCT). AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that cannot be automatically converted are clearly marked so that they can be manually converted to complete the migration.

AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy Oracle and SQL Server functions to their equivalent AWS service, thus helping you modernize the applications at the same time as database migration.

Figure 4.24 is snapshot of the Action Items tab in the AWS SCT report, which shows the items that the tool could not convert automatically. These are the items that you would need to evaluate and adjust manually as needed. The report helps you to determine how much work you would need to do to complete a conversion.

The figure shows a screenshot illustrating the Action Items tab in the AWS SCT report.

Figure 4.24 AWS SCT action items

After the schema conversion is complete, AWS SCT can help migrate data from a range of data warehouses to Amazon Redshift by using built-in data migration agents.

Your source database can be on-premises, in Amazon RDS, or in Amazon EC2, and the target database can be in either Amazon RDS or Amazon EC2. AWS SCT supports a number of different heterogeneous conversions. Table 4.9 lists the source and target databases that are supported at the time of this writing.

Table 4.9 Source and Target Databases Supported by AWS SCT

Source Database Target Database on Amazon RDS
Oracle Database Amazon Aurora, MySQL, PostgreSQL, Oracle
Oracle Data Warehouse Amazon Redshift
Azure SQL Amazon Aurora, MySQL, PostgreSQL
Microsoft SQL Server Amazon Aurora, Amazon Redshift, MySQL, PostgreSQL
Teradata Amazon Redshift
IBM Netezza Amazon Redshift
IBM DB2 LUW Amazon Aurora, MySQL, PostgreSQL
HPE Vertica Amazon Redshift
MySQL and MariaDB PostgreSQL
PostgreSQL Amazon Aurora, MySQL
Amazon Aurora PostgreSQL
Greenplum Amazon Redshift
Apache Cassandra Amazon DynamoDB

Running Your Own Database on Amazon Elastic Compute Cloud

This chapter focused heavily on the AWS services that are available from a managed database perspective. However, it is important to know that you can also run your own unmanaged database on Amazon EC2, not only for the exam but for managing projects in the real world. For example, if you want to run MongoDB on Amazon EC2, this is perfectly within the realm of possibility. However, by doing so, you lose the many benefits of using a managed database service.

Compliance and Security

AWS includes various methods to provide security for your databases and meet the strictest of compliance standards. You can use the following:

  • Network isolation through virtual private cloud (VPC)
  • Security groups
  • AWS resource-level permission controls that are IAM-based.
  • Encryption at rest by using AWS KMS or Oracle/Microsoft Transparent Data Encryption (TDE)
  • Secure Sockets Layer (SSL) protection for data in transit
  • Assurance programs for finance, healthcare, government, and more

AWS Identity and Access Management

You can use Identity and Access Management (IAM) to perform governed access to control who can perform actions with Amazon Aurora MySQL and Amazon RDS for MySQL. Here’s an example:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Sid": "AllowCreateDBInstanceOnly",
             "Effect": "Allow",
             "Action": [
                 "rds:CreateDBInstance"
             ],
             "Resource": [
                 "arn:aws:rds:*:123456789012:db:test*",
                 "arn:aws:rds: * : 123456789012:og:default*",
                 "arn:aws:rds:*:123456789012:pg:default*",
                 "arn:aws:rds: * : 1234 56789012 :subgrp: default"
             ],
             "Condition": {
                 "StringEquals": {
                 "rds:DatabaseEngine": "mysql",
                 "rds:DatabaseClass": "db.t2.micro"
                 }
             }
         }
     }
 }
 

Summary

In this chapter, you learned the basic concepts of different types of databases, including relational, nonrelational, data warehouse, in-memory, and graph databases. From there, you learned about the various managed database services available on AWS. These included Amazon RDS, Amazon DynamoDB, Amazon Redshift, Amazon ElastiCache, and Amazon Neptune. You also saw how you can run your own database on Amazon EC2. Finally, you looked at how to perform homogenous database migrations using the AWS Database Migration Service (AWS DMS). For heterogeneous database migrations, you learned that AWS DMS can use the AWS Schema Conversion Tool (AWS SCT).

Exam Essentials

Know what a relational database is. A relational database consists of one or more tables. Communication to and from relational databases usually involves simple SQL queries, such as “Add a new record” or “What is the cost of product x?” These simple queries are often referred to as online transaction processing (OLTP).

Know what a nonrelational database is. Nonrelational databases do not have a hard-defined data schema. They can use a variety of models for data management, such as in-memory key-value stores, graph data models, and document stores. These databases are optimized for applications that have a large data volume, require low latency, and have flexible data models. In nonrelational databases, there is no concept of foreign keys.

Understand the database options available on AWS. You can run all types of databases on AWS. You should understand that there are managed and unmanaged options available, in addition to relational, nonrelational, caching, graph, and data warehouses.

Understand which databases Amazon RDS supports. Amazon RDS currently supports six relational database engines:

  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • MariaDB
  • Amazon Aurora

Understand the operational benefits of using Amazon RDS. Amazon RDS is an AWS managed service. AWS is responsible for patching, antivirus, and the management of the underlying guest OS for Amazon RDS. Amazon RDS greatly simplifies the process of setting a secondary slave with replication for failover and setting up read replicas to offload queries.

Remember that you cannot access the underlying OS for Amazon RDS DB instances. You cannot use Remote Desktop Protocol (RDP) or SSH to connect to the underlying OS. If you need to access the OS, install custom software or agents. If you want to use a database engine that Amazon RDS does not support, consider running your database on an Amazon EC2 instance instead.

Understand that Amazon RDS handles Multi-AZ failover for you. If your primary Amazon RDS instance becomes unavailable, AWS fails over to your secondary instance in another Availability Zone automatically. This failover is done by pointing your existing database endpoint to a new IP address. You do not have to change the connection string manually; AWS handles the DNS changes automatically.

Remember that Amazon RDS read replicas are used for scaling out and increased performance. This replication feature makes it easy to scale out your read-intensive databases. Read replicas are currently supported in Amazon RDS for MySQL, PostgreSQL, and Amazon Aurora. You can create one or more replicas of a database within a single AWS Region or across multiple AWS Regions. Amazon RDS uses native replication to propagate changes made to a source DB instance to any associated read replicas. Amazon RDS also supports cross-region read replicas to replicate changes asynchronously to another geography or AWS Region.

Know how to calculate throughput for Amazon DynamoDB. Remember that one read capacity unit (RCU) represents one strongly consistent read per second or two eventually consistent reads per second for an item up to 4 KB in size. For writing data, know that one write capacity unit (WCU) represents one write per second for an item up to 1 KB in size. Be comfortable performing calculations to determine the appropriate setting for the RCU and WCU for a table.

Know that DynamoDB spreads RCUs and WCUs across partitions evenly. Recall that when you allocate your total RCUs and WCUs to a table, DynamoDB spreads these across your partitions evenly. For example, if you have 1,000 RCUs and you have 10 partitions, then you have 100 RCUs allocated to each partition.

Know the differences between a local secondary index and a global secondary index. Remember that you can create local secondary indexes only when you initially create the table; additionally, know that local secondary indexes must share the same partition key as the parent or source table. Conversely, you can create global secondary indexes at any time, with different partitions keys or sort keys.

Know the difference between eventually consistent and strongly consistent reads. Know that with eventually consistent reads, your application may retrieve data that is stale; but with strongly consistent reads, the data is always up-to-date.

Understand the purpose of caching data and which related services are available. Know why caching is important for your database tier and how it helps to improve your application performance. Additionally, understand the differences between the caching methods (lazy loading and write-through) and the corresponding AWS services (Amazon DynamoDB Accelerator (DAX), ElastiCache for Redis, and ElastiCache for Memcached).

Resources to Review

Exercises

In the following exercises, you will launch two types of databases: the first database is an SQL database on Amazon RDS, and the second is Amazon DynamoDB (NoSQL). For these sets of exercises, you will use the Python 3 SDK. You can download the Python 3 SDK at https://aws.amazon.com/sdk-for-python/.

Exercise 4.1

Create a Security Group for the Database Tier on Amazon RDS

Before you can create your first Amazon RDS database, you must create a security group so that you can allow traffic from your development server to communicate with the database tier. To do this, you must use an Amazon EC2 client to create the security group. Security groups are a component of the Amazon EC2 service, even though you can use them as part of Amazon RDS to secure your database tier.

To create the security group, run the following script:

# Excercise 4.1
import boto3
import json
import datetime

# Let's create some variables we'll use throughout these Excercises in Chapter 4
# NOTE: Here we are using a CIDR range for incoming traffic. We have set it to
     0.0.0.0/0 which means
# ANYONE on the internet can access your database if they have the username and
     the password
# If possible, specify you're own CIDR range. You can figure out your CIDR range
     by visiting the following link
# https://www.google.com/search?q=what+is+my+ip
# In the variable don't forget to add /32!
# If you aren't sure, leave it open to the world

# Variables
sg_name = 'rds-sg-dev-demo'
sg_description = 'RDS Security Group for AWS Dev Study Guide'
my_ip_cidr = '0.0.0.0/0'

# Create the EC2 Client to create the Security Group for your Database
ec2_client = boto3.client('ec2')

# First we need to create a security group
response = ec2_client.create_security_group(
     Description=sg_description,
     GroupName=sg_name)
print(json.dumps(response, indent=2, sort_keys=True))
# Now add a rule for the security group
response = ec2_client.authorize_security_group_ingress(
     CidrIp=my_ip_cidr,
     FromPort=3306,
     GroupName=sg_name,
     ToPort=3306,
     IpProtocol='tcp'
     )
print("Security Group should be created! Verify this in the AWS Console.")

After running the Python code, verify that the security group was created successfully from the AWS Management Console. You can find this confirmation under the VPC or Amazon EC2 service.

Exercise 4.2

Spin Up the MariaDB Database Instance

Use the Python SDK to spin up your MariaDB database hosted on Amazon RDS.

To spin up the MariaDB database, run the following script and update the Variables section to meet your needs:

# Excercise 4.2
import boto3
import json
import datetime

# Just a quick helper function for date time conversions, in case you want to
     print the raw JSON
def date_time_converter(o):
     if isinstance(o, datetime.datetime):
         return o.__str__()

# Variables
sg_name = 'rds-sg-dev-demo'
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
admin_email = '[email protected]'
sg_id_number = ''
rds_endpoint = ''
  
# We need to get the Security Group ID Number to use in the creation of the RDS
     Instance
ec2_client = boto3.client('ec2')
response = ec2_client.describe_security_groups(
     GroupNames=[
         sg_name
     ])

sg_id_number = json.dumps(response['SecurityGroups'][0]['GroupId'])
sg_id_number = sg_id_number.replace('"','')

#  Create the client for Amazon RDS
rds_client = boto3.client('rds')
  
# This will create our MariaDB Database
# NOTE: Here we are hardcoding passwords for simplicity and testing purposes
     only! In production
# you should never hardcode passwords in configuration files/code!
# NOTE: This will create an MariaDB Database. Be sure to remove it when you are
     done.
response = rds_client.create_db_instance(
     DBInstanceIdentifier=rds_identifier,
     DBName=db_name,
     DBInstanceClass='db.t2.micro',
     Engine='mariadb',
     MasterUsername='masteruser',
     MasterUserPassword='mymasterpassw0rd1!',
     VpcSecurityGroupIds=[
         sg_id_number
     ],
     AllocatedStorage=20,
     Tags=[
         {
             'Key': 'POC-Email',
             'Value': admin_email
         },
         {
             'Key': 'Purpose',
             'Value': 'AWS Developer Study Guide Demo'
         }
     ]
 )
  
# We need to wait until the DB Cluster is up!
print('Creating the RDS instance. This may take several minutes...')
waiter = rds_client.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=rds_identifier)
  
print('Okay! The Amazon RDS Database is up!')

After the script has executed, the following message is displayed:

Creating the RDS instance. This may take several minutes.

After the Amazon RDS database instance has been created successfully, the following confirmation is displayed:

Okay! The Amazon RDS Database is up!

You can also view these messages from the Amazon RDS console.

Exercise 4.3

Obtain the Endpoint Value for the Amazon RDS Instance

Before you can start using the Amazon RDS instance, you must first specify your endpoint. In this exercise, you will use the Python SDK to obtain the value.

To obtain the Amazon RDS endpoint, run the following script:

# Exercise 4.3
import boto3
import json
import datetime
 
 
# Just a quick helper function for date time conversions, in case you want to     print the raw JSON
def date_time_converter(o):
    if isinstance(o, datetime.datetime):
        return o.__str__()
 
# Variables
rds_identifier = 'my-rds-db'
 
#  Create the client for Amazon RDS
rds_client = boto3.client('rds')
 
print("Fetching the RDS endpoint...")
response = rds_client.describe_db_instances(
    DBInstanceIdentifier=rds_identifier
)
 
rds_endpoint = json.dumps(response['DBInstances'][0]['Endpoint']['Address'])
rds_endpoint = rds_endpoint.replace('"','')
print('RDS Endpoint: ' + rds_endpoint)

After running the Python code, the following status is displayed:

Fetching the RDS endpoint.. RDS Endpoint:<endpoint_name>

If the endpoint is not returned, from the AWS Management Console, under the RDS service, verify that your Amazon RDS database instance was created.

Exercise 4.4

Create a SQL Table and Add Records to It

You now have all the necessary information to create your first SQL table by using Amazon RDS. In this exercise, you will create a SQL table and add a couple of records. Remember to update the variables for your specific environment.

To update the variables, run the following script:

# Exercise 4.4
import boto3
import json
import datetime
import pymysql as mariadb
  
# Variables
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
rds_endpoint = 'my-rds-db.****.us-east-1.rds.amazonaws.com'
  
# Step 1 - Connect to the database to create the table
db_connection = mariadb.connect(host=rds_endpoint, user=user_name,
     password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
     cursor.execute("CREATE TABLE Users (user_id INT NOT NULL AUTO_INCREMENT,
user_fname VARCHAR(100) NOT NULL, user_lname VARCHAR(150) NOT NULL, user_
email VARCHAR(175) NOT NULL, PRIMARY KEY (`user_id`))")
     print('Table Created!')
except mariadb.Error as e:
     print('Error: {}'.format(e))
finally:
     db_connection.close()
  
# Step 2 - Connect to the database to add users to the table
db_connection = mariadb.connect(host=rds_endpoint, user=user_name,
  password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
     sql = "INSERT INTO `Users` (`user_fname`, `user_lname`, `user_email`) VALUES (%s, %s, %s)"
     cursor.execute(sql, ('CJ', 'Smith', '[email protected]'))
     cursor.execute(sql, ('Casey', 'Smith', '[email protected]'))
     cursor.execute(sql, ('No', 'One', '[email protected]'))
# No data is saved unless we commit the transaction!
     db_connection.commit()
     print('Inserted Data to Database!')
except mariadb.Error as e:
     print('Error: {}'.format(e))
     print('Sorry, something has gone wrong!')
finally:
     db_connection.close()

 

After running the Python code, the following confirmation is displayed:

Table Created! Inserted Data to the Database!

Your Amazon RDS database now has some data stored in it.

Symbol of Note In this exercise, you are hardcoding a password into your application code for demonstration purposes only. In a production environment, refrain from hard-coding application passwords. Instead, use services such as AWS Secrets Manager to keep your secrets secure.

Exercise 4.5

Query the Items in the SQL Table

After adding data to your SQL database, in this exercise you will be able to read or query the items in the Users table.

To read the items in the SQL table, run the following script:

# Exercise 4.5
import boto3
import json
import datetime
import pymysql as mariadb
 
# Variables
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
rds_endpoint = 'my-rds-db.*****.us-east-1.rds.amazonaws.com'
 
db_connection = mariadb.connect(host=rds_endpoint, user=user_name, password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
    sql = "SELECT * FROM `Users`"
    cursor.execute(sql)
    query_result = cursor.fetchall()
    print('Querying the Users Table...')
    print(query_result)
except mariadb.Error as e:
    print('Error: {}'.format(e))
    print('Sorry, something has gone wrong!')
finally:
    db_connection.close()
 

After running the Python code, you will see the three records that you inserted in the previous exercise.

Exercise 4.6

Remove Amazon RDS Database and Security Group

You’ve created an Amazon RDS DB instance and added data to it. In this exercise, you will remove a few resources from your account. Remove the Amazon RDS instance first.

To remove the Amazon RDS instance and the security group, run the following script:

# Exercise 4.6
import boto3
import json
import datetime
 
# Variables
rds_identifier = 'my-rds-db'
sg_name = 'rds-sg-dev-demo'
sg_id_number = ''
 
# Create the client for Amazon RDS
rds_client = boto3.client('rds')
 
# Delete the RDS Instance
response = rds_client.delete_db_instance(
    DBInstanceIdentifier=rds_identifier,
    SkipFinalSnapshot=True)
 
print('RDS Instance is being terminated...This may take several minutes.')
 
waiter = rds_client.get_waiter('db_instance_deleted')
waiter.wait(DBInstanceIdentifier=rds_identifier)
 
# We must wait to remove the security groups until the RDS database has been     deleted, this is a dependency.
print('The Amazon RDS database has been deleted. Removing Security Groups')
 
 
# Create the client for Amazon EC2 SG
ec2_client = boto3.client('ec2')
 
# Get the Security Group ID Number
response = ec2_client.describe_security_groups(
    GroupNames=[
        sg_name
    ])
sg_id_number = json.dumps(response['SecurityGroups'][0]['GroupId'])
sg_id_number = sg_id_number.replace('"','')
 
# Delete the Security Group!
response = ec2_client.delete_security_group(
    GroupId=sg_id_number
    )
 
print('Cleanup is complete!')

After running the Python code, the following message is displayed:

Cleanup is complete!

The Amazon RDS database and the security group are removed. You can verify this from the AWS Management Console.

Exercise 4.7

Create an Amazon DynamoDB Table

Amazon DynamoDB is a managed NoSQL database. One major difference between DynamoDB and Amazon RDS is that DynamoDB doesn’t require a server that is running in your VPC, and you don’t need to specify an instance type. Instead, create a table.

To create the table, run the following script:

# Exercise 4.7
import boto3
import json
import datetime
 
dynamodb_resource = boto3.resource('dynamodb')
 
table = dynamodb_resource.create_table(
    TableName='Users',
    KeySchema=[
        {
            'AttributeName': 'user_id',
            'KeyType': 'HASH'
        },
        {
            'AttributeName': 'user_email',
            'KeyType': 'RANGE'
        }
    ],
    AttributeDefinitions=[
        {
            'AttributeName': 'user_id',
            'AttributeType': 'S'
        },
        {
            'AttributeName': 'user_email',
            'AttributeType': 'S'
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5
    }
)
 
print("The DynamoDB Table is being created, this may take a few minutes...")
table.meta.client.get_waiter('table_exists').wait(TableName='Users')
print("Table is ready!")

After running the Python code, the following message is displayed:

Table is ready!

From the AWS Management Console, under DynamoDB, verify that the table was created.

Exercise 4.8

Add Users to the Amazon DynamoDB Table

With DynamoDB, there are fewer components to set up than there are for Amazon RDS. In this exercise, you’ll add users to your table. Experiment with updating and changing some of the code to add multiple items to the database.

To add users to the DynamoDB table, run the following script:  # Exercise 4.8
import boto3
import json
import datetime
# In this example we are not using uuid; however, you could use this to autogenerate your user IDs.
# i.e. str(uuid.uuid4())
import uuid
 
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
 
# Write a record to DynamoDB
response = table.put_item(
    Item={
        'user_id': '1234-5678',
        'user_email': '[email protected]',
        'user_fname': 'Sam',
        'user_lname': 'Samuels'
    }
)
 
# Just printing the raw JSON response, you should see a 200 status code
print(json.dumps(response, indent=2, sort_keys=True))
 

After running the Python code, you receive a 200 HTTP Status Code from AWS. This means that the user record has been added.

From the AWS Management Console, under DynamoDB, review the table to verify that the user record was added.

Exercise 4.9

Look Up a User in the Amazon DynamoDB Table

In this exercise, you look up the one user you’ve added so far.

To look up users in the DynamoDB table, run the following script:

# Exercise 4.9
import boto3
from boto3.dynamodb.conditions import Key
import json
import datetime
 
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
 
# Query a some data
response = table.query(
    KeyConditionExpression=Key('user_id').eq('1234-5678')
)
 
# Print the data out!
print(json.dumps(response['Items'], indent=2, sort_keys=True))
 

After running the Python code, the query results are returned in JSON format showing a single user.

Exercise 4.10

Write Data to the Table as a Batch Process

In this exercise, you will write data to the table through a batch process.

To write data using a batch process, run the following script:

# Exercise 4.10
import boto3
import json
import datetime
import uuid
  
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
  
# Generate some random data
with table.batch_writer() as user_data:
     for i in range(100):
         user_data.put_item(
             Item={
                 'user_id': str(uuid.uuid4()),
                 'user_email': 'someone' + str(i) + '@somewhere.com',
                 'user_fname': 'User' + str(i),
                 'user_lname': 'UserLast' + str(i)
             }
         )
         print('Writing record # ' + str(i+1) + ' to DynamoDB Users Table')
     print('Done!')
 

After running the Python code, the last few lines read as follows:

Writing record # 300 to DyanmoDB Users Table Done!

From the AWS Management Console, under DynamoDB Table, verify that the users were written to the table.

Exercise 4.11

Scan the Amazon DynamoDB Table

In this exercise, you will scan the entire table.

To scan the table, run the following script:

# Exercise 4.11
import boto3
import json
import datetime
import uuid
 
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
 
# Let's do a scan!
response = table.scan()
 
print('The total Count is: ' + json.dumps(response['Count']))
print(json.dumps(response['Items'], indent=2, sort_keys=True))
 

As you learned in this chapter, scans return the entire dataset located in the table. After running the script, all of the users are returned.

Exercise 4.12

Remove the Amazon DynamoDB Table

In this exercise, you will remove the DynamoDB table that you created in Exercise 4.7.

To remove the table, run the following script:

# Exercise 4.12
import boto3
import json
import datetime
import uuid
 
# Create a DynamoDB Resource
dynamodb_client = boto3.client('dynamodb')
 
# Delete the Table
response = dynamodb_client.delete_table(TableName='Users')
print(json.dumps(response, indent=2, sort_keys=True))
 

The DynamoDB table is deleted, or it is in the process of being deleted. Verify the deletion from the AWS Management Console, under the DynamoDB service.

 

Review Questions

  1. Which of the following does Amazon Relational Database Service (Amazon RDS) manage on your behalf? (Select THREE.)

    1. Database settings
    2. Database software installation and patching
    3. Query optimization
    4. Hardware provisioning
    5. Backups
  2. Which AWS database service is best suited for managing highly connected datasets?

    1. Amazon Aurora
    2. Amazon Neptune
    3. Amazon DynamoDB
    4. Amazon Redshift
  3. You are designing an ecommerce web application that will scale to potentially hundreds of thousands of concurrent users. Which database technology is best suited to hold the session state for large numbers of concurrent users?

    1. Relational database by using Amazon Relational Database Service (Amazon RDS)
    2. NoSQL database table by using Amazon DynamoDB
    3. Data warehouse by using Amazon Redshift
    4. MySQL on Amazon EC2
  4. How many read capacity units (RCUs) do you need to support 25 strongly consistent reads per seconds of 15 KB?

    1. 100 RCUs
    2. 25 RCUs
    3. 10 RCUs
    4. 15 RCUs
  5. How many read capacity units (RCUs) do you need to support 25 eventually consistent reads per seconds of 15 KB?

    1. 10 RCUs
    2. 25 RCUs
    3. 50 RCUs
    4. 15 RCUs
  6. How many write capacity units (WCUs) are needed to support 100 writers per second of 512 bytes?

    1. 129 WCUs
    2. 25 WCUs
    3. 10 WCUs
    4. 100 WCUs
  7. Your company is using Amazon DynamoDB, and they would like to implement a write-through caching mechanism. They would like to get everything up and running in only a few short weeks. Additionally, your company would like to refrain from managing any additional servers. You are the lead developer on the project; what should you recommend?

    1. Build your own custom caching application.
    2. Implement Amazon DynamoDB Accelerator (DAX).
    3. Run Redis on Amazon EC2.
    4. Run Memcached on Amazon EC2.
  8. Your company would like to implement a highly available caching solution for its SQL database running on Amazon RDS. Currently, all of its services are running in the AWS Cloud. As their lead developer, what should you recommend?

    1. Implement your own caching solution on-premises.
    2. Implement Amazon ElastiCache for Redis.
    3. Implement Amazon ElastiCache for Memcached.
    4. Implement Amazon DynamoDB Accelerator (DAX).
  9. A company is looking to run analytical queries and would like to implement a data warehouse. It estimates that it has roughly 300 TB worth of data, which is expected to double in the next three years. Which AWS service should you recommend?

    1. Relational database by using Amazon Relational Database Service (Amazon RDS)
    2. NoSQL database table by using Amazon DynamoDB
    3. Data warehouse by using Amazon Redshift
    4. Amazon ElastiCache for Redis
  10. A company is experiencing an issue with Amazon DynamoDB whereby the data is taking longer than expected to return from a query. You are tasked with investigating the problem. After looking at the application code, you realize that a Scan operation is being called for a large DynamoDB table. What should you do or recommend?

    1. Implement a query instead of a scan, if possible, as queries are more efficient than a scan.
    2. Do nothing; the problem should go away on its own.
    3. Implement a strongly consistent read.
    4. Increase the write capacity units (WCUs).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset