THE AWS CERTIFIED DEVELOPER – ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:
In addition to the storage options discussed in the previous chapters, AWS offers a broad range of databases purposely built for your specific application use cases. You can also set up your own database platform on the Amazon Elastic Compute Cloud (Amazon EC2). You can easily migrate your existing databases with the AWS Database Migration Service (AWS DMS) in a cost-effective manner.
AWS Cloud offerings include the following databases:
Depending on your use case, you can choose from an AWS database that closely aligns with your needs. Table 4.1 describes each database service that AWS offers and indicates the database type.
Table 4.1 AWS Database Service Mapping to Database Type
Product | Type | Description |
Amazon Aurora | Relational database | A MySQL- and PostgreSQL-compatible relational database built for the cloud that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. |
Amazon Relational Database Service (Amazon RDS) | Relational database | A managed relational database for MySQL, PostgreSQL, Oracle, SQL Server, and MariaDB. Easy to set up, operate, and scale a relational database in the cloud quickly. |
Amazon DynamoDB | NoSQL database | A serverless, managed NoSQL database that delivers consistent single-digit millisecond latency at any scale. Pay only for the throughput and storage you use. |
Amazon Redshift | Data warehouse | A fast, fully managed, petabyte-scale data warehouse at one-tenth the cost of traditional solutions. Simple and cost-effective solution to analyze data by using standard SQL and your existing business intelligence (BI) tools. |
Amazon ElastiCache | In-memory data store | To deploy, operate, and scale an in-memory data store based on Memcached or Redis in the cloud. |
Amazon Neptune | Graph database | A fast, reliable, fully managed graph database to store and manage highly connected datasets. |
Amazon Document DB (with MongoDB compatibility) | Nonrelational database | A fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. |
Amazon Timestream | Time series database | A fast, scalable, fully managed time series database service for IoT and operational applications that makes it easy to store and analyze trillions of events per day at one-tenth the cost of relational databases. |
Amazon Quantum Ledger Database (Amazon QLDB) | Ledger database | A fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. |
AWS Database Migration Service (AWS DMS) | Database migration | Help migrate your databases to AWS easily and inexpensively with minimal downtime. |
Now that you know the purpose of these database services and what they can do, review the type of applications that can be used for each AWS database service. Refer to the application type mappings shown in Table 4.2.
Table 4.2 Application Mapping to AWS Database Service
Applications | Product |
Transactional applications, such as ERP, CRM, and ecommerce to log transactions and store structured data. | Aurora or Amazon RDS |
Internet-scale applications, such as hospitality, dating, and ride sharing, to serve content and store structured and unstructured data. | DynamoDB or Amazon DocumentDB |
Analytic applications for operational reporting and querying terabyte- to exabyte-scale data. | Amazon Redshift |
Real-time application use cases that require submillisecond latency such as gaming leaderboards, chat, messaging, streaming, and Internet of Things (IoT). | ElastiCache |
Applications with use cases that require navigation of highly connected data such as social news feeds, recommendations, and fraud detection. | Neptune |
Applications that collect data at millions of inserts per second in a time-series fashion, for example clickstream data and IoT devices. | Timestream |
Applications that require an accurate history of their application data; for example, tracking the history of credits and debits in banking transactions or verifying the audit trails created in relational databases. | Amazon QLDB |
Many developers have had to interact with relational databases in their applications. This section describes first what a relational database is. Then, it covers how you can run a relational database in the AWS Cloud with Amazon RDS or on Amazon EC2.
A relational database is a collection of data items with predefined relationships between them. These items are organized as a set of tables with columns and rows. Tables store information about the objects to be represented in the database. Each column in a table holds certain data, and a field stores the actual value of an attribute. The rows in the table represent a collection of related values of one object or entity. Each row in a table contains a unique identifier called a primary key, and rows among multiple tables can be linked by using foreign keys. You can access data in many different ways without reorganizing the database tables.
Relational databases include four important characteristics: Structured Query Language, data integrity, transactions, and atomic, consistent, isolated, and durable compliance.
Structured query language (SQL) is the primary interface that you use to communicate with relational databases. The standard American National Standards Institute (ANSI) SQL is supported by all popular relational database engines. Some of these engines have extensions to ANSI SQL to support functionality that is specific to that engine. You use SQL to add, update, or delete data rows; to retrieve subsets of data for transaction processing and analytics applications; and to manage all aspects of the database.
Data integrity is the overall completeness, accuracy, and consistency of data. Relational databases use a set of constraints to enforce data integrity in the database. These include primary keys, foreign keys, NOT NULL constraints, unique constraint, default constraints, and check constraints. These integrity constraints help enforce business rules in the tables to ensure the accuracy and reliability of the data. In addition, most relational databases enable you to embed custom code triggers that execute based on an action on the database.
A database transaction is one or more SQL statements that execute as a sequence of operations to form a single logical unit of work. Transactions provide an all-or-nothing proposition, meaning that the entire transaction must complete as a single unit and be written to the database, or none of the individual components of the transaction will continue. In relational database terminology, a transaction results in a COMMIT or a ROLLBACK. Each transaction is treated in a coherent and reliable way, independent of other transactions.
All database transactions must be atomic, consistent, isolated, and durable (ACID)–compliant or be atomic, consistent, isolated, and durable to ensure data integrity.
Atomicity Atomicity requires that the transaction as a whole executes successfully, or if a part of the transaction fails, then the entire transaction is invalid.
Consistency Consistency mandates that the data written to the database as part of the transaction must adhere to all defined rules and restrictions, including constraints, cascades, and triggers.
Isolation Isolation is critical to achieving concurrency control, and it makes sure that each transaction is independent unto itself.
Durability Durability requires that all of the changes made to the database be permanent when a transaction is successfully completed.
Managed database services on AWS, such as Amazon RDS, enable you to offload the administrative burdens of operating and scaling distributed databases to AWS so that you don’t have to worry about the following tasks:
AWS provides a number of database alternatives for developers. As a managed database, Amazon RDS enables you to run a fully featured relational database while off-loading database administration. By contrast, you can run unmanaged databases on Amazon EC2, which gives you more flexibility on the types of databases that you can deploy and configure; however, you are responsible for the administration of the unmanaged databases.
With Amazon Relational Database Service (Amazon RDS), you can set up, operate, and scale a relational database in the AWS Cloud. It provides cost-efficient, resizable capacity for open-standard relational database engines. Amazon RDS is easy to administer, and you do not need to install the database software. Amazon RDS manages time-consuming database administration tasks, which frees you up to focus on your applications and business. For example, Amazon RDS automatically patches the database software and backs up your database. The Amazon RDS managed relational database service works with the popular database engines depicted in Figure 4.1.
Amazon RDS assumes many of the difficult or tedious management tasks of a relational database:
Procurement, configuration, and backup tasks
Security and availability
When you host databases on Amazon RDS, AWS is responsible for the items in Figure 4.2.
Amazon RDS provides six familiar database engines: Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL, and MariaDB. Because Amazon RDS is a managed service, you gain a number of benefits and features built right into the Amazon RDS service. These features include, but are not limited to, the following:
To create an Amazon RDS instance, you can run the following command from the AWS CLI:
aws rds create-db-instance
--db-instance-class db.t2.micro
--allocated-storage 30
--db-instance-identifier my-cool-rds-db --engine mysql
--master-username masteruser --master-user-password masterpassword1!
Depending on the configurations chosen, the database can take several minutes before it is active and ready for use. You can monitor the Amazon RDS Databases console to view the status. When the status states Available, it is ready to be used, as shown in Figure 4.3.
Periodically, Amazon RDS performs maintenance on Amazon RDS resources. Maintenance mostly involves patching the Amazon RDS database underlying operating system (OS) or database engine version. Because this is a managed service, Amazon RDS handles the patching for you.
When you create an Amazon RDS database instance, you can define a maintenance window. A maintenance window is where you can define a period of time when you want to apply any updates or downtime to your database instance. You also can enable the automatic minor version upgrade feature, which automatically applies any new minor versions of the database as they are released (see Figure 4.4).
You can select a maintenance window by using the AWS Management Console, AWS CLI, or Amazon RDS API. After selecting the maintenance window, the Amazon RDS instance is upgraded (if upgrades are available) during that time window. You can also modify the maintenance window by running the following AWS CLI command:
aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --preferred-maintenance-window Mon:07:00-Mon:07:30
If your database needs to handle a bigger load, you can vertically scale your Amazon RDS instance. At the time of this writing, there are 40 available DB instance classes, which enable you to choose the number of virtual CPUs and memory available. This gives you flexibility over the performance and cost of your Amazon RDS database. To scale the Amazon RDS instance, you can use the console, AWS CLI, or AWS SDK.
If you are in a Single-AZ configuration for your Amazon RDS instance, the database is unavailable during the scaling operation. However, if you are in a Multi-AZ configuration, the standby database is upgraded first and then a failover occurs to the newly configured database. You can also apply the change during the next maintenance window. This way, your upgrade can occur during your normal outage windows.
To scale the Amazon RDS database by using the AWS CLI, run the following command:
aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --db-instance-class db.t2.medium
Storage is a critical component for any database. Amazon RDS has the following three storage types:
General Purpose SSD (gp2) This storage type is for cost-effective storage that is ideal for a broad range of workloads. Gp2 volumes deliver single-digit millisecond latencies and the ability to burst to 3,000 IOPS for extended periods of time. The volume’s size determines the performance of gp2 volumes.
Provisioned IOPS (io1) This storage type is for input/output-intensive workloads that require low input/output (I/O) latency and consistent I/O throughput.
Magnetic Storage This storage type is designed for backward compatibility, and AWS recommends that you use General Purpose SSD or Provisioned IOPS for any new Amazon RDS workloads.
To scale your storage, you must modify the Amazon RDS DB instance by executing the following AWS CLI command:
aws rds modify-db-instance --db-instance-identifer your-db-instance-identifer --allocated-storage 50 --storage-type io1 --iops 3000
This command modifies your storage to 50 GB in size, with a Provisioned IOPS storage drive and a dedicated IOPS of 3000. While modifying the Amazon RDS DB instance, consider the potential downtime.
There are two ways to scale your database tier with Amazon RDS: vertical scaling and horizontal scaling. Vertical scaling takes the primary database and increases the amount of memory and vCPUs allocated for the primary database. Alternatively, use horizontal scaling (add another server) to your database tier to improve the performance of applications that are read-heavy as opposed to write-heavy.
Read replicas create read-only copies of your master database, which allow you to offload any reads (or SQL SELECT statements) to the read replica. The replication from the master database to the read replica is asynchronous. As a result, the data queried from the read replica is not the latest data. If your application requires strongly consistent reads, consider an alternative option.
At the time of this writing, Amazon RDS MySQL, PostgreSQL, and MariaDB support up to five read replicas, and Amazon Aurora supports up to 15 read replicas. Microsoft SQL Server and Oracle do not support read replicas.
To create a read replica by using AWS CLI, run the following command:
aws rds create-db-instance-read-replica --db-instance-identifier your-db-instance-identifier --source-db-instance-identifier your-source-db
Amazon RDS has two different ways of backing up data of your database instance: automated backups and database snapshots (DB snapshots).
With Amazon RDS, automated backups offer a point-in-time recovery of your database. When enabled, Amazon RDS performs a full daily snapshot of your data that is taken during your preferred backup window. After the initial backup is taken (each day), then Amazon RDS captures transaction logs as changes are made to the database.
After you initiate a point-in-time recovery, to restore your database instance, the transaction logs are applied to the most appropriate daily backup. You can perform a restore up to the specific second, as long as it’s within your retention period. The default retention period is seven days, but it can be a maximum of up to 35 days.
To perform a restore, you must choose the Latest Restorable Time, which is typically within the last 5 minutes. For example, suppose that the current date is February 14 at 10 p.m., and you would like to do a point-in-time restore of February 14 at 9 p.m. This restore would succeed because the Latest Restorable Time is a maximum of February 14 at 9:55 p.m. (which is the last 5-minute window). However, a point-in-time restore of February 14 at 9:58 p.m. would fail, because it is within the 5-minute window.
Automated backups are kept until the source database is deleted. After the source Amazon RDS instance is removed, the automated backups are also removed.
Unlike automated backups, database snapshots with Amazon RDS are user-initiated and enable you to back up your database instance in a known state at any time. You can also restore to that specific snapshot at any time.
Similar to the other Amazon RDS features, you can create the snapshots through the AWS Management Console, with the CreateDBSnapshot API, or with the AWS CLI.
With DB snapshots, the backups are kept until you explicitly delete them; therefore, before removing any Amazon RDS instance, take a final snapshot before removing it. Regardless of the backup taken, storage I/O may be briefly suspended while the backup process initializes (typically a few seconds), and you may experience a brief period of elevated latency. A way to avoid these types of suspensions is to deploy in a Multi-AZ configuration. With such a deployment, the backup is taken from the standby instead of the primary database.
To create a snapshot of the database, from the Amazon RDS Databases console, under Actions, select the Take snapshot option (see Figure 4.5). After a snapshot is taken, you can view all of your snaps from the Snapshots console.
By using Amazon RDS, you can run in a Multi-AZ configuration. In a Multi-AZ configuration, you have a primary and a standby DB instance. Updates to the primary database replicate synchronously to the standby replica in a different Availability Zone. The primary benefit of Multi-AZ is realized during certain types of planned maintenance, or in the unlikely event of a DB instance failure or an Availability Zone failure. Amazon RDS automatically fails over to the standby so that you can resume your workload as soon as the standby is promoted to the primary. This means that you can reduce your downtime in the event of a failure.
Because Amazon RDS is a managed service, Amazon RDS handles the fail to the standby. When there is a DB instance failure, Amazon RDS automatically promotes the standby to the primary—you will not interact with the standby directly. In other words, you will receive one endpoint connection for the Amazon RDS cluster, and Amazon RDS handles the failover.
Amazon RDS Multi-AZ configuration provides the following benefits:
To create an Amazon RDS instance in a Multi-AZ configuration, you must specify a subnet group that has two different Availability Zones specified. You can specify a Multi-AZ configuration by using AWS CLI by adding the --multi-az flag to the AWS CLI command, as follows:
aws rds create-db-instance
--db-instance-class db.t2.micro
--allocated-storage 30
--db-instance-identifier multi-az-rds-db --engine mysql
--master-username masteruser
--master-user-password masterpassword1!
--multi-az
For encryption at rest, Amazon RDS uses the AWS Key Management Service (AWS KMS) for AES-256 encryption. You can use a default master key or specify your own for the Amazon RDS DB instance. Encryption is one of the few options that must be configured when the DB instance is created. You cannot modify an Amazon RDS database to enable encryption. You can, however, create a DB snapshot and then restore to an encrypted DB instance or cluster.
Amazon RDS supports using the Transparent Data Encryption (TDE) for Oracle and SQL Server. For more information on TDE with Oracle and Microsoft SQL Server, see the following:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.SQLServer .Options.TDE.html
At the time of this writing, the following Amazon RDS DB instance types are not supported for encryption at rest:
For encryption in transit, Amazon RDS generates an SSL certificate for each database instance that can be used to connect your application and the Amazon RDS instance. However, encryption is a compute-intensive operation that increases the latency of your database connection. For more information, see the documentation for the specific database engine.
You can authenticate to your DB instance by using IAM. By using IAM, you can manage access to your database resources centrally instead of storing the user credentials in each database. The IAM feature also encrypts network traffic to and from the database by using SSL.
IAM DB authentication is supported only for MySQL and PostgreSQL. At the time of this writing, the following MySQL versions are supported:
There’s no support for the following:
The following PostgreSQL versions are supported:
To enable IAM DB authentication for your Amazon RDS instance, run the following command:
aws rds modify-db-instance --db-instance-identifier my-rds-db --enable-iam-database-authentication --apply-immediately
Because downtime is associated with this action, you can enable this feature during the next maintenance window. You can do so by changing the last parameter to --no-apply-immediately.
Use Amazon CloudWatch to monitor your database tier. You can create alarms to notify database administrators when there is a failure.
By default, CloudWatch provides some built-in metrics for Amazon RDS with a granularity of 5 minutes (600 seconds). If you want to gather metrics in a smaller window of granularity, such as 1 second, enable enhanced monitoring, which is similar to how you enable these features in Amazon EC2.
To view all the Amazon RDS metrics that are provided through CloudWatch, select the Monitoring tab from the Amazon RDS console (see Figure 4.6).
Amazon RDS integrates with CloudWatch to send it the following database logs:
From the Amazon RDS console, select the Logs & events tab to view and download the specified logs, as shown in Figure 4.7.
For more information on CloudWatch and its capabilities across other AWS services, see Chapter 15, “Monitoring and Troubleshooting.”
Amazon Aurora is a MySQL- and PostgreSQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
Aurora is part of the managed database service Amazon RDS.
Aurora is a drop-in replacement for MySQL and PostgreSQL relational databases. It is built for the cloud, and it combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. You can use the code, tools, and applications that you use today with your existing MySQL and PostgreSQL databases with Aurora.
The integration of Aurora with Amazon RDS means that time-consuming administration tasks, such as hardware provisioning, database setup, patching, and backups, are automated.
Aurora features a distributed, fault-tolerant, self-healing storage system that automatically scales up to 64 TiB per database instance. (In comparison, other Amazon RDS options allow for a maximum of 32 TiB.) Aurora delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon Simple Storage Service (Amazon S3), and replication across three Availability Zones. When you create an Aurora instance, you create a DB cluster. A DB cluster consists of one or more DB instances and a cluster volume that manages the data for those instances. An Aurora cluster volume is a virtual database storage volume that spans multiple Availability Zones, and each Availability Zone has a copy of the DB cluster data.
An Aurora DB cluster has two types of DB instances:
Primary Instance Supports read and write operations and performs all of the data modifications to the cluster volume. Each Aurora DB cluster has one primary instance.
Amazon Aurora Replica Supports read-only operations. Each Aurora DB cluster can have up to 15 Amazon Aurora Replicas in addition to the primary instance. Multiple Aurora Replicas distribute the read workload, and if you locate Aurora Replicas in separate Availability Zones, you can also increase database availability.
Figure 4.8 illustrates the relationship between the cluster volume, the primary instance, and Aurora Replicas in an Aurora DB cluster.
As you can see from Figure 4.8, this architecture is vastly different from the other Amazon RDS databases. Aurora is engineered and architected for the cloud. The primary difference is that there is a separate storage layer, called the cluster volume, which is spread across multiple Availability Zones in a single AWS Region. This means that the durability of your data is increased.
Additionally, Aurora has one primary instance that writes across the cluster volume. This means that Aurora replicas can be spun up quickly, because they don’t have to copy and store their own storage layer; they connect to it. Because the cluster volume is separated in this architecture, the cluster volume can grow automatically as your data increases. This is in contrast to how other Amazon RDS databases are built, whereby you must define the allocated storage in advance.
With Aurora, you can also create a multiregional deployment for your database tier. In this configuration, the primary AWS Region is where your data is written (you may also do reads from the primary AWS Region). Any application performing writes must write to the primary AWS Region where the cluster is operating.
The secondary AWS Region is used for reading data only. Aurora replicates the data to the secondary AWS Region with typical latency of less than a second. Furthermore, you can use the secondary AWS Region for disaster recovery purposes. You can promote the secondary cluster and make it available as the primary typically in less than a minute. At the time of this writing, Aurora global databases are available in the following AWS Regions only:
Additionally, at the time of this writing, Aurora global databases are available only for MySQL 5.6.
Aurora Serverless is an on-demand, automatic scaling configuration for Aurora. (It is available only for MySQL at the time of this writing.) With Aurora Serverless, the database will automatically start up, shut down, and scale capacity up or down based on your application’s needs. This means that, as a developer, you can run your database in the AWS Cloud and not worry about managing any database instances.
The following are best practices for working with Amazon RDS:
Follow Amazon RDS basic operational guidelines. The Amazon RDS Service Level Agreement requires that you follow these guidelines:
Allocate sufficient RAM to the DB instance. An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. Check the ReadIOPS metric by using CloudWatch while the DB instance is under load to view the working set. The value of ReadIOPS should be small and stable. Scale up the DB instance class until ReadIOPS no longer drops dramatically after a scaling operation or when ReadIOPS is reduced to a small amount.
Implement Amazon RDS security. Use IAM accounts to control access to Amazon RDS API actions, especially actions that create, modify, or delete Amazon RDS resources, such as DB instances, security groups, option groups, or parameter groups, and actions that perform common administrative actions, such as backing up and restoring DB instances, or configuring Provisioned IOPS storage.
Use the AWS Management Console, the AWS CLI, or the Amazon RDS API to change the password for your master user. If you use another tool, such as a SQL client, to change the master user password, it might result in permissions being revoked for the user unintentionally.
Use enhanced monitoring to identify OS issues. Amazon RDS provides metrics in real time for the OS on which your DB instance runs. You can view the metrics for your DB instance by using the console or consume the Enhanced Monitoring JSON output from CloudWatch Logs in a monitoring system of your choice. Enhanced Monitoring is available for the following database engines:
Enhanced Monitoring is available for all DB instance classes except for db.m1.small. Enhanced Monitoring is available in all regions except for AWS GovCloud (US).
Use metrics to identify performance issues. To identify performance issues caused by insufficient resources and other common bottlenecks, you can monitor the metrics available for your Amazon RDS DB instance.
Monitor performance metrics regularly to see the average, maximum, and minimum values for a variety of time ranges. If you do so, you can identify when performance is degraded. You can also set CloudWatch alarms for particular metric thresholds.
To troubleshoot performance issues, it’s important to understand the baseline performance of the system. When you set up a new DB instance and get it running with a typical workload, you should capture the average, maximum, and minimum values of all the performance metrics at a number of different intervals (for example, 1 hour, 24 hours, 1 week, or 2 weeks) to get an idea of what is normal. It helps to get comparisons for both peak and off-peak hours of operation. You can then use this information to identify when performance is dropping below standard levels.
Tune queries. One of the best ways to improve DB instance performance is to tune your most commonly used and most resource-intensive queries to make them less expensive to run.
A common aspect of query tuning is creating effective indexes. You can use the Database Engine Tuning Advisor to get potential index improvements for your DB instance.
Use DB parameter groups. AWS recommends that you apply changes to the DB parameter group on a test DB instance before you apply parameter group changes to your production DB instances. Improperly setting DB engine parameters in a DB parameter group can have unintended adverse effects, including degraded performance and system instability. Always exercise caution when modifying DB engine parameters, and back up your DB instance before modifying a DB parameter group.
Use read replicas. Use read replicas to relieve pressure on your master node with additional read capacity. You can bring your data closer to applications in different regions and promote a read replica to a master for faster recovery in the event of a disaster.
You can use the AWS Database Migration Service (AWS DMS) to migrate or replicate your existing databases easily to Amazon RDS.
Nonrelational databases are commonly used for internet-scale applications that do not require any complex queries.
NoSQL databases are nonrelational databases optimized for scalable performance and schema-less data models. NoSQL databases are also widely recognized for their ease of development, low latency, and resilience.
NoSQL database systems use a variety of models for data management, such as in-memory key-value stores, graph data models, and document stores. These types of databases are optimized for applications that require large data volume, low latency, and flexible data models, which are achieved by relaxing some of the data consistency restrictions of traditional relational databases.
NoSQL databases are a great fit for many big data, mobile, and web applications that require greater scale and higher responsiveness than traditional relational databases. Because of simpler data structures and horizontal scaling, NoSQL databases typically respond faster and are easier to scale than relational databases.
Many developers are familiar with SQL databases but might be new to working with NoSQL databases. Relational database management systems (RDBMS) and nonrelational (NoSQL) databases have different strengths and weaknesses. In a RDBMS, data can be queried flexibly, but queries are relatively expensive and do not scale well in high-traffic situations. In a NoSQL database, you can query data efficiently in a limited number of ways. Table 4.3 shows a comparison of different characteristics of SQL and NoSQL databases.
Table 4.3 SQL vs. NoSQL Database Characteristics
Type | SQL | NoSQL |
Data Storage | Rows and columns | Key-value, document, wide-column, graph |
Schemas | Fixed | Dynamic |
Querying | Using SQL | Focused on collection of documents |
Scalability | Vertical | Horizontal |
Transactions | Supported | Support varies |
Consistency | Strong | Eventual and strong |
The storage format for SQL versus NoSQL databases also differs. As shown in Figure 4.9, SQL databases are often stored in a row and column format, whereas NoSQL databases, such as Amazon DynamoDB, have a key-value format that could be in a JSON format, as shown in this example.
There are four types of NoSQL databases: columnar, document, graph, and in-memory key-value. Generally, these databases differ in how the data is stored, accessed, and structured, and they are optimized for different use cases and applications.
Columnar databases Columnar databases are optimized for reading and writing columns of data as opposed to rows of data. Column-oriented storage for database tables is an important factor in analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data that you must load from disk.
Document databases Document databases are designed to store semi-structured data as documents, typically in JSON or XML format. Unlike traditional relational databases, the schema for each NoSQL document can vary, giving you more flexibility in organizing and storing application data and reducing storage required for optional values.
Graph databases Graph databases store vertices and directed links called edges. Graph databases can be built on both SQL and NoSQL databases. Vertices and edges can each have properties associated with them.
In-memory key-value stores In-memory key-value stores are NoSQL databases optimized for read-heavy application workloads (such as social networking, gaming, media sharing, and Q&A portals) or compute-intensive workloads (such as a recommendation engine). In-memory caching improves application performance by storing critical pieces of data in memory for low-latency access.
Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed cloud database, and it supports both document and key-value store models. Its flexible data model, reliable performance, and automatic scaling of throughput capacity make it a great fit for the following:
With DynamoDB, you can create database tables that can store and retrieve any amount of data and serve any level of request traffic. You can scale up or scale down your table throughput capacity without downtime or performance degradation. DynamoDB automatically spreads the data and traffic for your tables over a sufficient number of servers to handle your throughput and storage requirements while maintaining consistent and fast performance. All of your data is stored on solid-state drives (SSDs) and automatically replicated across multiple Availability Zones in an AWS Region, providing built-in high availability and data durability. You can use global tables to keep DynamoDB tables in sync across AWS Regions.
In DynamoDB, tables, items, and attributes are the core components with which you work. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses partition keys to identify uniquely each item in a table. Secondary indexes can be used to provide more querying flexibility. You can use DynamoDB Streams to capture data modification events in DynamoDB tables.
Figure 4.10 shows the DynamoDB data model, including a table, items, attributes, a required partition key, an optional sort key, and an example of data being stored in partitions.
Similar to other database systems, DynamoDB stores data in tables. A table is a collection of items. For example, a table called People could be used to store personal contact information about friends, family, or anyone else of interest.
An item in DynamoDB is similar in many ways to rows, records, or tuples in other database systems. Each DynamoDB table contains zero or more items. An item is a collection of attributes that is uniquely identifiable for each record in that table. For a People table, each item represents a person. There is no limit to the number of items that you can store in a table.
Each item is composed of one or more attributes. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems. An attribute is a fundamental data element, something that does not need to be broken down any further. You can think of an attribute as similar to columns in a relational database. For example, an item in a People table contains attributes called PersonID, Last Name, First Name, and so on.
Figure 4.11 shows a table named People with items and attributes. Each block represents an item, and within those blocks you have attributes that define the overall item:
When you create a table, at a minimum, you are required to specify the table name and primary key of the table. The primary key uniquely identifies each item in the table. No two items can have the same key within a table.
DynamoDB supports two different kinds of primary keys: partition key and partition key and sort key.
Partition key (hash key) A simple primary key, composed of one attribute, is known as the partition key. DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item is stored.
In a table that has only a partition key, no two items can have the same partition key value. For example, in the People table, with a simple primary key of PersonID, you cannot have two items with PersonID of 000-07-1075.
The partition key of an item is also known as its hash attribute. The term hash attribute derives from the use of an internal hash function in DynamoDB that evenly distributes data items across partitions based on their partition key values.
Each primary key attribute must be a scalar (meaning that it can hold only a single value). The only data types allowed for primary key attributes are string, number, or binary. There are no such restrictions for other, nonkey attributes.
Partition key and sort key (range attribute) A composite primary key is composed of two attributes: partition key and the sort key.
The sort key of an item is also known as its range attribute. The term range attribute derives from the way that DynamoDB stores items with the same partition key physically close together, in sorted order, by the sort key value.
The partition key acts the same as the sort key, but in addition to also using a sort key, the items with the same partition key are stored together, in sorted order, by sort key value.
In a table that has a partition key and a sort key, it’s possible for two items to have the same partition key value, but those two items must have different sort key values. You cannot have two items in the table that have identical partition key and sort key values.
For example, if you have a Music table with a composite primary key (Artist and SongTitle), you can access any item in the Music table directly if you provide the Artist and SongTitle values for that item.
A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist with a range of values for SongTitle.
As a developer, the attribute you choose for your application has important implications. If there is little differentiation among partition keys, all of your data is stored together in the same physical location.
Figure 4.12 shows an example of these two types of keys. In the SensorLocation table, the primary key is the SensorId attribute. This means that every item (or row) in this table has a unique SensorId, meaning that each sensor has exactly one location or latitude and longitude value.
Conversely, the SensorReadings table has a partition key and a sort key. The SensorId attribute is the partition key and the Time attribute is the sort key, which combined make it a composite key. For each SensorId, there may be multiple items corresponding to sensor readings at different times. The combination of SensorId and Time uniquely identifies items in the table. This design enables you to query the table for all readings related to a particular sensor.
If you want to perform queries on attributes that are not part of the table’s primary key, you can create a secondary index. By using a secondary index, you can query the data in the table by using an alternate key, in addition to querying against the primary key. DynamoDB does not require that you use indexes, but doing so may give you more flexibility when querying your data depending on your application and table design.
After you create a secondary index on a table, you can then read data from the index in much the same way as you do from the table. DynamoDB automatically creates indexes based on the primary key of a table and automatically updates all indexes whenever a table changes.
A secondary index contains the following:
DynamoDB provides fast access to the items in a table by specifying primary key values. However, many applications might benefit from having one or more secondary (or alternate) keys available. This allows efficient access to data with attributes other than the primary key.
DynamoDB supports two types of secondary indexes: local secondary indexes and global secondary indexes. You can define up to five global secondary indexes and five local secondary indexes per table.
A local secondary index is an index that has the same partition key as the base table, but a different sort key (see Figure 4.13). It is “local” in the sense that every partition of a local secondary index is scoped to a base table partition that has the same partition key value. You can construct only one while creating the table, but you cannot add, remove, or modify it later.
A global secondary index is an index with a partition key and a sort key that can be different from those on the base table (see Figure 4.14). It is considered “global” because queries on the index can span all of the data in the base table across all partitions. You can create one during table creation, and you can add, remove, or modify it later.
You can create a global secondary index, not a local secondary index, after table creation.
For example, by using a Music table, you can query data items by Artist (partition key) or by Artist and SongTitle (partition key and sort key). Suppose that you also wanted to query the data by Genre and Album Title. To do this, you could create a global secondary index on Genre and AlbumTitle and then query the index in much the same way as you’d query the Music table.
Figure 4.15 shows the example Music table with a new index called GenreAlbumTitle. In the index, Genre is the partition key, and AlbumTitle is the sort key.
Note the following about the GenreAlbumTitle index:
You can query the GenreAlbumTitle index to find all albums of a particular genre (for example, all Hard Rock albums). You can also query the index to find all albums within a particular genre that have certain album titles (for example, all Heavy Metal albums with titles that start with the letter M).
To determine which type of index to use, consider your application’s requirements. Table 4.4 shows the main differences between a global secondary index and a local secondary index.
Table 4.4 Comparison of Local and Global Secondary Indexes
Characteristic | Global Secondary Index | Local Secondary Index |
Query Scope | Entire table, across all partitions. | Single partition, as specified by the partition key value in the query. |
Key Attributes |
|
|
Projected Attributes | Only projected attributes can be queried. | Can query attributes that are not projected. Attributes are retrieved from the base table. |
Read Consistency | Eventual consistency only. | Eventual consistency or strong consistency. |
Provisioned Throughput |
|
|
Lifecycle Considerations | Can be created or deleted at any time. |
|
Amazon DynamoDB Streams is an optional feature that captures data modification events in DynamoDB tables. The data about these events appears in the stream in near real time and in the order that the events occurred.
Each event is represented by a stream record. If you enable a stream on a table, DynamoDB Streams writes a stream record whenever one of the following events occurs:
Each stream record also contains the name of the table, the event timestamp, and other metadata. Stream records have a lifetime of 24 hours; after that, they are automatically removed from the stream.
Figure 4.16 shows how you can use DynamoDB Streams together with AWS Lambda to create a trigger—code that executes automatically whenever an event of interest appears in a stream. For example, consider a Customers table that contains customer information for a company. Suppose that you want to send a “welcome” email to each new customer. You could enable a stream on that table and then associate the stream with a Lambda function. The Lambda function would execute whenever a new stream record appears, but only process new items added to the Customers table. For any item that has an EmailAddress attribute, the Lambda function could invoke Amazon Simple Email Service (Amazon SES) to send an email to that address.
In the example shown in Figure 4.16, the last customer, Craig Roe, will not receive an email because he does not have an EmailAddress.
In addition to triggers, DynamoDB Streams enables other powerful solutions that developers can create, such as the following:
DynamoDB replicates data among multiple Availability Zones in a region. When your application writes data to a DynamoDB table and receives an HTTP 200 response (OK), all copies of the data are updated. The data is eventually consistent across all storage locations, usually within 1 second or less. DynamoDB supports both eventually consistent and strongly consistent reads.
When you read data from a DynamoDB table immediately after a write operation, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data. DynamoDB uses eventually consistent reads, unless you specify otherwise.
When querying data, you can specify whether DynamoDB should return strongly consistent reads. When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting updates from all prior write operations that were successful. A strongly consistent read might not be available if there is a network delay or outage.
As a developer, it is important to understand the needs of your application. In some applications, eventually consistent reads might be fine, such as a high-score dashboard. In other applications or parts of an application, however, such as a financial or medical system, an eventually consistent read could be an issue. You will want to evaluate your data usage patterns to ensure that you are choosing the right type of reads for each part of your application.
There is an additional cost for strongly consistent reads, and they will have more latency in returning data than an eventually consistent read. So, that cost and timing should also play into your decision.
When you create a table or index in DynamoDB, you must specify your capacity requirements for read and write activity. By defining your throughput capacity in advance, DynamoDB can reserve the necessary resources to meet the read and write activity your application requires, while ensuring consistent, low-latency performance. Specify your required throughput value by setting the ProvisionedThroughput parameter when you create or update a table.
You specify throughput capacity in terms of read capacity units and write capacity units:
For example, suppose that you create a table with five read capacity units and five write capacity units. With these settings, your application could do the following:
If your application reads or writes larger items (up to the DynamoDB maximum item size of 400 KB), it consumes more capacity units.
If your read or write requests exceed the throughput settings for a table, DynamoDB can throttle that request. DynamoDB can also throttle read requests excess for an index. Throttling prevents your application from consuming too many capacity units. When a request is throttled, it fails with an HTTP 400 code (Bad Request) and a ProvisionedThroughputExceededException. The AWS SDKs have built-in support for retrying throttled requests, so you do not need to write this logic yourself.
DynamoDB provides the following mechanisms for managing throughput as it changes:
Amazon DynamoDB Auto Scaling DynamoDB automatic scaling actively manages throughput capacity for tables and global secondary indexes. With automatic scaling, you define a range (upper and lower limits) for read and write capacity units. You also define a target utilization percentage within that range. DynamoDB auto scaling seeks to maintain your target utilization, even as your application workload increases or decreases.
Provisioned throughput If you aren’t using DynamoDB auto scaling, you have to define your throughput requirements manually. As discussed, with this setting you may run into a ProvisionedThroughputExceededException if you are throttled. But you can change your throughput with a few clicks.
Reserved capacity You can purchase reserved capacity in advance, where you pay a one-time upfront fee and commit to a minimum usage level over a period of time. You may realize significant cost savings compared to on-demand provisioned throughput settings.
On-demand It can be difficult to plan capacity, especially if you aren’t collecting metrics or perhaps are developing a new application and you aren’t sure what type of performance you require. With On-Demand mode, your DynamoDB table will automatically scale up or down to any previously reached traffic level. If a workload’s traffic level reaches a new peak, DynamoDB rapidly adapts to accommodate the workload. As a developer, focus on making improvements to your application and offload scaling activities to AWS.
When you are using a table in DynamoDB, the data is placed on multiple partitions (depending on the amount of data and the amount of throughput allocated to it; recall that throughput is determined by RCUs and WCUs). When you allocate RCUs and WCUs to a table, those RCUs and WCUs are split evenly among all partitions for your table.
For example, suppose that you have allocated 1,000 RCUs and 1,000 WCUs to a table, and this table has 10 partitions allocated to it. Then each partition would have 100 RCUs and 100 WCUs for it to use. If one of your partitions consumes all the RCUs and WCUs for the table, you may receive a ProvisionedThroughputExceededException error because one of your partitions is hot. To deal with hot partitions, DynamoDB has two features: burst capacity and adaptive capacity.
The previous example discussed how you had 10 partitions, each with 100 RCUs and 100 WCUs allocated to them. One of your partitions begins to become hot and now needs to consume more than 100 RCUs. Under normal circumstances, you may receive the ProvisionedThroughputExceededException error. However, with burst capacity, whenever your partition is not using all of its total capacity, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle any spike your partition may experience.
At the time of this writing, DynamoDB currently reserves up to 300 seconds (5 minutes) of unused read and write capacity, which means that your partition can handle a peak load for 5 minutes over its normal expected load. Burst capacity is enabled and runs in the background.
Adaptive capacity is when it is not always possible to distribute read and write activity to a partition evenly. In the example, a partition is experiencing not only peak demand but also consistent demand over and above its normal 100 RCU and 100 WCUs. Suppose that now this partition requires 200 RCUs instead of 100 RCUs.
DynamoDB adaptive capacity enables your application to continue reading and writing to hot partitions without being throttled, provided that the total provisioned capacity for the table is not exceeded. DynamoDB allocates additional RCUs to the hot partition; in this case, 100 more. With adaptive capacity, you will still be throttled for a period of time, typically between 5–30 minutes, before adaptive capacity turns on or activates. So, for a portion of time, your application will be throttled; however, after adaptive capacity allocates the RCUs to the partition, DynamoDB is able to sustain the new higher throughput for your partition and table. Adaptive capacity is on by default, and there is no need to enable or disable it.
Two primary methods are used to retrieve data from DynamoDB: Query and Scan.
In DynamoDB, you perform Query operations directly on the table or index. To run the Query command, you must specify, at a minimum, a primary key. If you are querying an index, you must specify both TableName and IndexName.
The following is a query on a Music table in DynamoDB using the Python SDK:
import boto3
import json
import decimal
# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('Music')
print("A query with DynamoDB")
response = table.query(
KeyConditionExpression=Key('Artist').eq('Sam Samuel')
)
for i in response['Items']:
print(i['SongTitle'], "-", i['Genre'], i['Price'])
The query returns all of the songs by the artist Sam Samuel in the Music table.
You can also perform Scan operations on a table or index. The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index. To have DynamoDB return fewer items, you can provide a FilterExpression operation.
If the total number of scanned items exceeds the maximum dataset size limit of 1 MB, the scan stops, and the results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation. The results also include the number of items exceeding the limit. A scan can result in no table data meeting the filter criteria.
A single Scan operation reads up to the maximum number of items set (if you’re using the Limit parameter) or a maximum of 1 MB of data and then applies any filtering to the results by using FilterExpression. If LastEvaluatedKey is present in the response, you must paginate the result set.
Scan operations proceed sequentially; however, for faster performance on a large table or secondary index, applications can request a parallel Scan operation by providing the Segment and TotalSegments parameters.
Scan uses eventually consistent reads when accessing the data in a table; therefore, the result set might not include the changes to data in the table immediately before the operation began. If you need a consistent copy of the data, as of the time that the Scan begins, you can set the ConsistentRead parameter to true.
The following is a scan on a Movies table with the Python SDK:
// Return all of the data in the index
import boto3
import json
import decimal
# Create the DynamoDB Resource
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
# Use the Music Table
table = dynamodb.Table('Music')
# Helper class to convert a DynamoDB decimal/item to JSON.
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
# Specify some filters for the scan
# Here we are stating that the Price must be between 12 - 30
fe = Key('Price').between(12, 30)
pe = "#g, Price"
# Expression Attribute Names for Projection Expression only.
ean = { "#g": "Genre", }
#
response_scan = table.scan(
FilterExpression=fe,
ProjectionExpression=pe,
ExpressionAttributeNames=ean
)
# Print all the items
for i in response_scan['Items']:
print(json.dumps(i, cls=DecimalEncoder))
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=pe,
FilterExpression=fe,
ExpressionAttributeNames= ean,
ExclusiveStartKey=response['LastEvaluatedKey']
)
for i in response['Items']:
print(json.dumps(i)
As you can see from the Python code, the scan returns all records with a price of between 12 and 30 and the genre and the price. The LastEvaluatedKey property is included to continue to loop through the entire table.
Global tables build upon the DynamoDB global footprint to provide you with a fully managed, multiregion, and multimaster database that provides fast, local, read-and-write performance for massively scaled, global applications. DynamoDB performs all the necessary tasks to create identical tables in these regions and propagate ongoing data changes to all of them. Figure 4.17 shows an example of how global tables can work with a global application and globally dispersed users.
A global table is a collection of one or more DynamoDB tables, all owned by a single AWS account, identified as replica tables. A replica table (or replica, for short) is a single DynamoDB table that functions as a part of a global table. Each replica stores the same set of data items. Any given global table can have only one replica table per region, and every replica has the same table name and the same primary key schema. Changes made in one replica are recorded in a stream and propagated to other replicas, as shown in Figure 4.18.
If your application requires strongly consistent reads, then it must perform all of its strongly consistent reads and writes in the same region. DynamoDB does not support strongly consistent reads across AWS Regions.
Conflicts can arise if applications update the same item in different regions at about the same time (concurrent updates). To ensure eventual consistency, DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates whereby DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all replicas agree on the latest update and converge toward a state in which they all have identical data.
To create a DynamoDB global table, perform the following steps:
The AWS Management Console automates these tasks so that you can create a global table quickly and easily.
The DynamoDB object persistence model enables you to map client-side classes to DynamoDB tables. The instances of these classes (objects) map to items in a DynamoDB table.
You can use the object persistence programming interface to connect to DynamoDB; perform create, read, update, and delete operations (CRUD); execute queries; and implement optimistic locking with a version number. Figure 4.19 shows an example of using the object persistence model to map client-side objects in DynamoDB.
Support for the object persistence model is available in the Java and .NET SDKs.
DynamoDB Local is the downloadable version of DynamoDB that lets you write and test applications by using the Amazon DynamoDB API without accessing the DynamoDB web service. Instead, the database is self-contained on your computer. When you’re ready to deploy your application in production, you can make a few minor changes to the code so that it uses the DynamoDB web service.
Having this local version helps you save on provisioned throughput, data storage, and data transfer fees. In addition, you don’t need an internet connection while you’re developing your application.
You can use AWS IAM to grant or restrict access to DynamoDB resources and API actions. For example, you could allow a user to execute the GetItem operation on a Books table. DynamoDB also supports fine-grained access control so that you can control access to individual data items and attributes. This means that perhaps you have a Users table and you want the specific user to have access only to his or her data. You can accomplish this with fine-grained access control. Use a condition inside an IAM policy with the dynamodb:LeadingKeys property.
By using LeadingKeys, you can limit the user so that they can access only the items where the partition key matches the userID. In the following example in the Users table, you want to restrict who can view the profile information to only the user to which the data or profile information belongs:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LeadingKeysExample",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:BatchGetItem",
"dynamodb:Query",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:BatchWriteItem"
],
"Resource": [
"arn:aws:dynamodb:us-east-1:accountnumber:table/UserProfiles"
],
"Condition": {
"ForAllValues:StringEquals": {
"dynamodb:LeadingKeys": [
"${www.amazon.com:user_id}"
],
"dynamodb:Attributes": [
"UserId",
"FirstName",
"LastName",
"Email",
"Birthday"
]
},
"StringEqualsIfExists": {
"dynamodb:Select": "SPECIFIC_ATTRIBUTES"
}
}
}
]
}
As you can see in the IAM policy, only the specific user is allowed to access a subset of the total attributes that are defined in the Attributes section of the policy. Furthermore, the SELECT statement specifies that the application must provide a list of specific attributes to act upon, preventing the application from requesting all attributes.
You can create on-demand backups and enable point-in-time recovery for your DynamoDB tables.
On-demand backups create full backups of your tables or restore them on-demand at any time. These actions execute with zero impact on table performance or availability and without consuming any provisioned throughput on the table.
Point-in-time recovery helps protect your DynamoDB tables from accidental write or delete operations. For example, suppose that a test script accidentally writes to a production DynamoDB table. With point-in-time recovery, you can restore that table to any point in time during the last 35 days. DynamoDB maintains incremental backups of your table. These operations will not affect performance or latency.
DynamoDB offers fully managed encryption at rest, and it is enabled by default. DynamoDB uses AWS KMS for encrypting the objects at rest. By default, DynamoDB uses the AWS-owned customer master key (CMK); however, you can also specify your own AWS KMS CMK key that you have created. For more information on AWS KMS, see Chapter 5, “Encryption on AWS.”
Now that you understand what DynamoDB is and how you can use it to create a scalable database for your application, review some best practices for using DynamoDB.
The primary key or partition key portion of a table’s primary key determines the logical partitions in which the table’s data is stored. These logical partitions also affect the underlying physical partitions. As a result, you want to distribute your workload across the partitions as evenly as possible, reducing the number of “hot” partition issues that may arise.
Table 4.5 compares the more common partition key schemas and whether they are good for DynamoDB.
Table 4.5 Amazon DynamoDB Partition Key Recommended Strategies
Partition Key Value | Uniformity |
User ID, where the application has many users | Good |
Status code, where there are only a few possible status codes | Bad |
Item creation date, rounded to the nearest time period (for example, day, hour, or minute) | Bad |
Device ID, where each device accesses data at relatively similar intervals | Good |
Device ID, where even if there are many devices being tracked, one is by far more popular than all the others | Bad |
The Query operation finds items in a table based on primary key values. You must provide the name of the partition key attribute and the value of that attribute. You can provide a sort key attribute name and value to refine the search results (for example, all of the forums with this ID in the last seven days). By default, Query returns all of the data attributes for those items with specified primary keys. The results are sorted by the sort key in ascending order, which can be reversed. Additionally, queries are set to be Eventually Consistent, with an option to change to Strongly Consistent, if necessary.
The Scan operation returns all of the item attributes by accessing every item in the table. It is for this reason that Query is more efficient than the Scan operation.
If you are performing analytics, you may want to use a data warehouse. A data warehouse is a central repository of information that you can analyze to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through BI tools, SQL clients, and other analytics applications.
A data warehouse architecture consists of three tiers. The bottom tier of the architecture is the database server, where data is loaded and stored. The middle tier consists of the analytics engine that is used to access and analyze the data. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools.
A data warehouse works by organizing data into a schema that describes the layout and type of data, such as integer, data field, or string. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.
Benefits of using a data warehouse include the following:
The data warehousing landscape has changed dramatically in recent years with the emergence of cloud-based services that offer high performance, simple deployment, near-infinite scaling, and easy administration at a fraction of the cost of on-premises solutions.
A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction. Table 4.6 is useful in comparing the characteristics of data warehouses and databases.
Table 4.6 Comparison of Data Warehouse and Database Characteristics
Characteristics | Data Warehouse | Transactional Database |
Suitable Workloads | Analytics, reporting, big data | Transaction processing |
Data Source | Data collected and normalized from many sources | Data captured as-is from a single source, such as a transactional system |
Data Capture | Bulk write operations typically on a predetermined batch schedule | Optimized for continuous write operations as new data is available to maximize transaction throughput |
Data Normalization | Denormalized schemas, such as the star schema or snowflake schema | Highly normalized, static schemas |
Data Storage | Optimized for simplicity of access and high-speed query performance by using columnar storage | Optimized for high-throughout write operations to a single row-oriented physical block |
Data Access | Optimized to minimize I/O and maximize data throughput | High volumes of small read operations |
Unlike a data warehouse, a data lake, as described in Chapter 3, “Hello, Storage,” is a centralized repository for all data, including structured and unstructured. A data warehouse uses a predefined schema that is optimized for analytics. In a data lake, the schema is not defined, enabling additional types of analytics, such as big data analytics, full text search, real-time analytics, and machine learning. Table 4.7 compares the characteristics of a data warehouse and a data lake.
Table 4.7 Comparison of Data Warehouse and Data Lake Characteristics
Characteristics | Data Warehouse | Data Lake |
Data | Relational data from transactional systems, operational databases, and line-of-business applications | Nonrelational and relational data from IoT devices, websites, mobile apps, social media, and corporate applications |
Schema | Designed before the data warehouse implementation (schema-on-write) | Written at the time of analysis (schema-on-read) |
Price/Performance | Fastest query results by using higher-cost storage | Query results getting faster by using low-cost storage |
Data Quality | Highly curated data that serves as the central version of the truth | Any data that may or may not be curated (in other words, raw data) |
Users | Business analysts, data scientists, and data developers | Data scientists, data developers, and business analysts (using curated data) |
Analytics | Batch reporting, BI, and visualizations | Machine learning, predictive analytics, data discovery, and profiling |
A data mart is a data warehouse that serves the needs of a specific team or business unit, such as finance, marketing, or sales. It is smaller, is more focused, and may contain summaries of data that best serve its community of users. Table 4.8 compares the characteristics of a data warehouse and a data mart.
Table 4.8 Comparison of Data Warehouse and Data Mart Characteristics
Characteristics | Data Warehouse | Transactional Database |
Scope | Centralized, multiple subject areas integrated together | Decentralized, specific subject area |
Users | Organization-wide | A single community or department |
Data Source | Many sources | A single or a few sources, or a portion of data already collected in a data warehouse |
Size | Large—can be 100s of gigabytes to petabytes | Small, generally up to 10s of gigabytes |
Design | Top-down | Bottom-up |
Data Detail | Complete, detailed data | May hold summarized data |
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data by using standard SQL and your existing BI tools. With Amazon Redshift, you can run complex analytic queries against petabytes of structured data using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. Most results come back in seconds. Amazon Redshift is up to 10 times faster than traditional on-premises data warehouses at 1/10 the cost.
An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. After you provision your cluster, you can upload your dataset and then perform data analysis queries. Each cluster has a leader node and one or more compute nodes, and you have a choice of a hardware platform for your cluster.
Amazon Redshift integrates with various data loading and extract, transform, and load (ETL) tools and BI reporting, data mining, and analytics tools. It is based on open standard PostgreSQL, so most existing SQL client applications will integrate with Amazon Redshift with only minimal changes. For important differences between Amazon Redshift SQL and PostgreSQL, see the Amazon Redshift documentation.
The leader node acts as the SQL endpoint and receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates a parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results to the client applications. The leader node also stores metadata about the cluster. Amazon Redshift communicates with client applications by using open standard PostgreSQL, JDBC, and ODBC drivers.
Compute nodes execute the query execution plan and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.
A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and allocates the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The node size of the cluster determines the number of slices per node.
Figure 4.20 shows the Amazon Redshift data warehouse architecture, including the client applications, JDBC and ODBC connections, leader node, compute nodes, and node slices.
A cluster contains one or more databases. User data is stored on the compute nodes.
When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. There are two categories for node types. The dense storage (DS) node types are storage-optimized using large magnetic disks and can provide up to 2 PB of storage capacity. The dense compute (DC) node types are compute-optimized. Because they use solid state drive (SSD) storage, they deliver much faster I/O compared to DS node types but provide less storage space at a maximum of 326 TB.
Each database within an Amazon Redshift cluster can support many tables. Like most SQL-based databases, you can create a table using the CREATE TABLE command. This command specifies the name of the table, the columns, and their data types. This command also supports specifying compression encodings, distribution strategy, and sort keys in Amazon Redshift.
Each value that Amazon Redshift stores or retrieves has a data type with a fixed set of associated properties. Data types are declared when tables are created. Additional columns can be added to a table by using the ALTER TABLE command, but you cannot change the data type on an existing column.
Many familiar data types are available, including the following:
Numeric data types
Text data types
Date data types
Logical data type
Amazon Redshift uses data compression as one of the key performance optimizations. When you load data for the first time into an empty table, Amazon Redshift samples your data automatically and selects the best compression scheme for each column. Alternatively, you can specify your preferred compression encoding on a per-column basis as part of the CREATE TABLE command.
When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table’s distribution style. When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
This is one of the primary decisions when you’re creating a table in Amazon Redshift. You can configure the distribution style of a table to give Amazon Redshift hints as to how the data should be partitioned to meet your query patterns. The style you select for your database affects query performance, storage requirements, data loading, and maintenance. By choosing the best distribution strategy for each table, you can balance your data distribution and significantly improve overall system performance.
When creating a table, you can choose among one of the three distribution styles: EVEN, KEY, or ALL.
EVEN distribution Rows are distributed across the slices in a round-robin fashion, regardless of the values in any particular column. It is an appropriate choice when a table does not participate in joins or when there is not a clear choice between KEY distribution or ALL distribution. EVEN is the default distribution type.
KEY distribution Rows are distributed according to the values in one column. The leader node attempts to place matching values on the same node slice. Use this style when you will be querying heavily against values of a specific column.
ALL distribution A copy of the entire table is distributed to every node. This ensures that every row is collocated for every join in which the table participates. This multiplies the storage required by the number of nodes in the cluster, and it takes much longer to load, update, or insert data into multiple tables. Use this style only for relatively slow-moving tables that are not updated frequently or extensively.
Another important decision to make during table creation is choosing the appropriate sort key. Amazon Redshift stores your data on disk in sorted order according to the sort key, and the query optimizer uses sort order when it determines the optimal query plans. Specify an appropriate sort key for the way that your data will be queried, filtered, or joined.
The following are some general guidelines for choosing the best sort key:
Loading large datasets can take a long time and consume many computing resources. How your data is loaded can also affect query performance. You can reduce these impacts by using COPY commands, bulk inserts, and staging tables when loading data into Amazon Redshift.
The COPY command loads data in parallel from Amazon S3 or other data sources in a more efficient manner than INSERT commands.
You can query Amazon Redshift tables by using standard SQL commands, such as using SELECT statements, to query and join tables. For complex queries, you are able to analyze the query plan to choose better optimizations for your specific access patterns.
For large clusters supporting many users, you can configure workload management (WLM) to queue and prioritize queries.
Amazon Redshift supports snapshots, similar to Amazon RDS. You can create automated and manual snapshots, which are stored in Amazon S3 by using an encrypted Secure Socket Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify.
When you restore from a snapshot, Amazon Redshift creates a new cluster and makes it available before all of the data is loaded so that you can begin querying the new cluster immediately. Amazon Redshift will stream data on demand from the snapshot in response to active queries and load all the remaining data in the background.
Achieving proper durability for a database requires more effort and more attention. Even when using Amazon Elastic Block Store (Amazon EBS) volumes, take snapshots on a frozen file system to be consistent. Also, restoring a database might require additional operations other than restoring a volume from a snapshot and attaching it to an Amazon EC2 instance.
Securing your Amazon Redshift cluster is similar to securing other databases running in the AWS Cloud. To meet your needs, you will use a combination of IAM policies, security groups, and encryption to secure the cluster.
Protecting the data stored in Amazon Redshift is an important aspect of your security design. Amazon Redshift supports encryption of data in transit using SSL-encrypted connections.
You can also enable database encryption for your clusters to help protect data at rest. AWS recommends enabling encryption for clusters that contain sensitive data. You might be required to use encryption depending on the compliance guidelines or regulations that govern your data. Encryption is an optional setting, and it must be configured during the cluster launch. To change encryption on a cluster, you need to create a new cluster and migrate the data.
Amazon Redshift automatically integrates with AWS KMS.
Implement security at every level of your Amazon Redshift architecture, including the infrastructure resources, database schema, data, and network access.
Access to Amazon Redshift resources is controlled at three levels: cluster management, cluster connectivity, and database access. For details on the controls available to help you manage each of these areas, see the Amazon Redshift Cluster Management Guide at https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html.
The following are some best practices for securing your Amazon Redshift deployments:
Amazon Redshift also includes Redshift Spectrum, allowing you to run SQL queries directly against exabytes of unstructured data in Amazon S3. No loading or transformation is required.
You can use many open data formats, including Apache Avro, CSV, Grok, Ion, JSON, Optimized Row Columnar (ORC), Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV. Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run fast, regardless of dataset size.
To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that’s connected to your cluster so that you can execute SQL commands. The cluster and the data files in Amazon S3 must be in the same AWS Region.
In-memory data stores are used for caching and real-time workloads. AWS provides a variety of in-memory, key-value database options. You can operate your own nonrelational key-value data store in the cloud on Amazon EC2 and Amazon EBS, work with AWS solution providers, or take advantage of fully managed nonrelational services such as Amazon ElastiCache.
In computing, the data in a cache is generally stored in fast-access hardware, such as random-access memory (RAM), and may also be used in correlation with a software component. A cache’s primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.
Trading off capacity for speed, a cache typically stores a subset of data transiently, in contrast to databases whose data is usually complete and durable.
A cache provides high-throughput, low-latency access to commonly accessed application data by storing the data in memory. Caching can improve the speed of your application. Caching reduces the response latency, which improves a user’s experience with your application.
Time-consuming database queries and complex queries often create bottlenecks in applications. In read-intensive applications, caching can provide large performance gains by reducing application processing time and database access time. Write-intensive applications typically do not see as great a benefit to caching. However, even write-intensive applications normally have a read/write ratio greater than 1, which implies that read caching can be beneficial. In summary, the benefits of caching include the following:
The following types of information or applications can often benefit from caching:
Consider caching your data if the following conditions apply:
You can implement different caching strategies for your application. Two primary methods are lazy loading and write through. A cache hit occurs when the cache contains the information requested. A cache miss occurs when the cache does not contain the information requested.
Lazy loading Lazy loading is a caching strategy that loads data into the cache only when necessary. When your application requests data, it first makes the request to the cache. If the data exists in the cache (a cache hit), it is retrieved; but if it does not or has expired (a cache miss), then the data is retrieved from your data store and then stored in the cache. The advantage of lazy loading is that only the requested data is cached. The disadvantage is that there is a cache miss penalty resulting in three trips:
Write through The write-through strategy adds data or updates in the cache whenever data is written to the database. The advantage of write through is that the data in the cache is never stale. The disadvantage is that there is a write penalty because every write involves two trips: a write to the cache and a write to the database. Another disadvantage is that because most data is never read in many applications, the data or information that is stored in the cluster is never used. This storage incurs a cost for space and overhead due to the duplicate data. In addition, if your data is updated frequently, the cache may be updating often, causing cache churn.
An in-memory key-value store is a NoSQL database optimized for read-heavy application workloads (such as social networking, gaming, media sharing, and Q&A portals) or compute-intensive workloads (such as a recommendation engine). In-memory caching improves application performance by storing critical pieces of data in memory for low-latency access. Cached information may include the results of I/O-intensive database queries or the results of computationally intensive calculations.
The strict performance requirements imposed by real-time applications mandate more efficient databases. Traditional databases rely on disk-based storage. A single user action may consist of multiple database calls. As they accumulate, latency increases. However, by accessing data in memory, in-memory data stores provide higher throughput and lower latency. In fact, in-memory data stores can be one to two orders of magnitude faster than disk-based databases.
As a NoSQL data store, an in-memory data store does not share the architectural limitations found in traditional relational databases. NoSQL data stores are built to be scalable. Traditional relational databases use a rigid table-based architecture. Some NoSQL data stores use a key-value store and therefore don’t enforce a structure on the data. This enables scalability and makes it easier to grow, partition, or shard data as data stores grow. When consumed as a cloud-based service, an in-memory data store also provides availability and cost benefits. On-demand access allows organizations to scale their applications as needed in response to demand spikes and at a lower cost than disk-based stores. Using managed cloud services also eliminates the need to administer infrastructure. Database hotspots are reduced, and performance becomes more predictable. Some cloud-based services also offer the benefit of high availability with replicas and support for multiple Availability Zones.
A caching layer helps further drive throughput for read-heavy applications. A caching layer is a high-speed storage layer that stores a subset of data. When a read request is sent, the caching layer checks to determine whether it has the answer. If it doesn’t, the request is sent on to the database. Meeting read requests through the caching layer in this manner is more efficient and delivers higher performance than what can be had from a traditional database alone.
It is also more cost-effective. A single node of in-memory cache can deliver the same read throughput as several database nodes. Instead of provisioning additional instances of your traditional database to accommodate a demand spike, you can drive more throughput by adding one node of distributed cache, replacing several database nodes. The caching layer saves you money because you’re paying for one node instead of multiple database nodes, and you get the added benefit of dramatically faster performance for reads.
Developers need a way to maintain super-low latency, even as they accommodate spikes in demand and while controlling infrastructure and database costs and load.
Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the AWS Cloud. The service improves the performance of web applications by allowing you to retrieve information from fast, managed, in-memory caches instead of relying entirely on slower disk-based databases.
ElastiCache automatically detects and replaces failed nodes, reducing the overhead associated with self-managed infrastructures and also provides a resilient system that mitigates the risk of overloaded cloud databases, which slow website and application load times.
ElastiCache currently supports two different open-source, in-memory, key-value caching engines: Redis and Memcached. Each engine provides some advantages.
Redis is an increasingly popular open-source, key-value store that supports more advanced data structures, such as sorted sets, hashes, and lists. Unlike Memcached, Redis has disk persistence built in, meaning that you can use it for long-lived data. Redis also supports replication, which can be used to achieve Multi-AZ redundancy, similar to Amazon RDS.
Memcached is a widely adopted in-memory key store. It is historically the gold standard of web caching. ElastiCache is protocol-compliant with Memcached, and it is designed to work with popular tools that you use today with existing Memcached environments. Memcached is also multithreaded, meaning that it makes good use of larger Amazon EC2 instance sizes with multiple cores.
Although both Memcached and Redis appear similar on the surface, in that they are both in-memory key stores, they are quite different in practice. Because of the replication and persistence features of Redis, ElastiCache manages Redis more as a relational database. Redis ElastiCache clusters are managed as stateful entities that include failover, similar to how Amazon RDS manages database failover.
Conversely, because Memcached is designed as a pure caching solution with no persistence, ElastiCache manages Memcached nodes as a pool that can grow and shrink, similar to an Amazon EC2 Auto Scaling group. Individual nodes are expendable, and ElastiCache provides additional capabilities here, such as automatic node replacement and Auto Discovery.
Consider the following requirements when deciding between Memcached and Redis.
Use Memcached if you require one or more of the following:
Use Redis if you require one or more of the following:
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to 10 times the performance improvement—from milliseconds to microseconds—even at millions of requests per second. DAX does all of the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.
With DAX, you can focus on building great applications for your customers without worrying about performance at scale. You do not need to modify application logic, because DAX is compatible with existing DynamoDB API calls. You can enable DAX with a few clicks in the AWS Management Console or by using the AWS SDK, and you pay only for the capacity you provision.
AWS provides a variety of graph database options, such as Amazon Neptune, or you can operate your own graph database in the cloud on Amazon EC2 and Amazon EBS. This section takes a closer look at what exactly is a graph database and when you would want to use one.
Many applications being built today must understand and navigate relationships between highly connected data. This can enable use cases like the following:
Because the data is highly connected, it is easily represented as a graph, which is a data structure that consists of vertices and directed links called edges. Vertices and edges can each have properties associated with them. Figure 4.21 depicts a simple graph of relationships between friends and their interests—or social network—that could be stored and queried by using a graph database. A graph database is optimized to store and process graph data.
AWS provides a variety of graph database options. You can operate your own graph database in the cloud on Amazon EC2 and Amazon EBS. You can also use Neptune, a fully managed graph database service.
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency.
Neptune is highly available and provides the following features:
Figure 4.22 shows a knowledge graph that can be powered by Neptune.
Neptune supports the popular graph models Property Graph and W3C’s RDS and their respective query languages Apache TinkerPop Gremlin and SPARQL. With these models, you can easily build queries that efficiently navigate highly connected datasets. Neptune graph databases include the following use cases:
Data is the cornerstone of successful cloud application deployments. Your evaluation and planning process may highlight the physical limitations inherent to migrating data from on-premises locations into the cloud. Amazon offers a suite of tools to help you move data via networks, roads, and technology partners.
This chapter focuses on the AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT). Customers also use other AWS services and features that are discussed in Chapter 3, “Hello, Storage,” for cloud data migration, including the following:
AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. AWS DMS can migrate your data to and from the most widely used commercial and open-source databases.
The service supports homogenous database migrations, such as Oracle to Oracle, in addition to heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. You can also stream data to Amazon Redshift, Amazon DynamoDB, and Amazon S3 from any of the supported sources, such as Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, SAP ASE, SQL Server, IBM DB2 LUW, and MongoDB, enabling consolidation and easy analysis of data in a petabyte-scale data warehouse. You can also use AWS DMS for continuous data replication with high availability.
Figure 4.23 shows an example of both heterogeneous and homogenous database migrations.
To perform a database migration, AWS DMS connects to the source data store, reads the source data, and formats the data for consumption by the target data store. It then loads the data into the target data store. Most of this processing happens in memory, though large transactions might require some buffering to disk. Cached transactions and log files are also written to disk.
At a high level, when you’re using AWS DMS, complete the following tasks:
A task can consist of three major phases:
For heterogeneous database migrations, AWS DMS uses the AWS Schema Conversion Tool (AWS SCT). AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that cannot be automatically converted are clearly marked so that they can be manually converted to complete the migration.
AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy Oracle and SQL Server functions to their equivalent AWS service, thus helping you modernize the applications at the same time as database migration.
Figure 4.24 is snapshot of the Action Items tab in the AWS SCT report, which shows the items that the tool could not convert automatically. These are the items that you would need to evaluate and adjust manually as needed. The report helps you to determine how much work you would need to do to complete a conversion.
After the schema conversion is complete, AWS SCT can help migrate data from a range of data warehouses to Amazon Redshift by using built-in data migration agents.
Your source database can be on-premises, in Amazon RDS, or in Amazon EC2, and the target database can be in either Amazon RDS or Amazon EC2. AWS SCT supports a number of different heterogeneous conversions. Table 4.9 lists the source and target databases that are supported at the time of this writing.
Table 4.9 Source and Target Databases Supported by AWS SCT
Source Database | Target Database on Amazon RDS |
Oracle Database | Amazon Aurora, MySQL, PostgreSQL, Oracle |
Oracle Data Warehouse | Amazon Redshift |
Azure SQL | Amazon Aurora, MySQL, PostgreSQL |
Microsoft SQL Server | Amazon Aurora, Amazon Redshift, MySQL, PostgreSQL |
Teradata | Amazon Redshift |
IBM Netezza | Amazon Redshift |
IBM DB2 LUW | Amazon Aurora, MySQL, PostgreSQL |
HPE Vertica | Amazon Redshift |
MySQL and MariaDB | PostgreSQL |
PostgreSQL | Amazon Aurora, MySQL |
Amazon Aurora | PostgreSQL |
Greenplum | Amazon Redshift |
Apache Cassandra | Amazon DynamoDB |
This chapter focused heavily on the AWS services that are available from a managed database perspective. However, it is important to know that you can also run your own unmanaged database on Amazon EC2, not only for the exam but for managing projects in the real world. For example, if you want to run MongoDB on Amazon EC2, this is perfectly within the realm of possibility. However, by doing so, you lose the many benefits of using a managed database service.
AWS includes various methods to provide security for your databases and meet the strictest of compliance standards. You can use the following:
You can use Identity and Access Management (IAM) to perform governed access to control who can perform actions with Amazon Aurora MySQL and Amazon RDS for MySQL. Here’s an example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCreateDBInstanceOnly",
"Effect": "Allow",
"Action": [
"rds:CreateDBInstance"
],
"Resource": [
"arn:aws:rds:*:123456789012:db:test*",
"arn:aws:rds: * : 123456789012:og:default*",
"arn:aws:rds:*:123456789012:pg:default*",
"arn:aws:rds: * : 1234 56789012 :subgrp: default"
],
"Condition": {
"StringEquals": {
"rds:DatabaseEngine": "mysql",
"rds:DatabaseClass": "db.t2.micro"
}
}
}
}
}
In this chapter, you learned the basic concepts of different types of databases, including relational, nonrelational, data warehouse, in-memory, and graph databases. From there, you learned about the various managed database services available on AWS. These included Amazon RDS, Amazon DynamoDB, Amazon Redshift, Amazon ElastiCache, and Amazon Neptune. You also saw how you can run your own database on Amazon EC2. Finally, you looked at how to perform homogenous database migrations using the AWS Database Migration Service (AWS DMS). For heterogeneous database migrations, you learned that AWS DMS can use the AWS Schema Conversion Tool (AWS SCT).
Know what a relational database is. A relational database consists of one or more tables. Communication to and from relational databases usually involves simple SQL queries, such as “Add a new record” or “What is the cost of product x?” These simple queries are often referred to as online transaction processing (OLTP).
Know what a nonrelational database is. Nonrelational databases do not have a hard-defined data schema. They can use a variety of models for data management, such as in-memory key-value stores, graph data models, and document stores. These databases are optimized for applications that have a large data volume, require low latency, and have flexible data models. In nonrelational databases, there is no concept of foreign keys.
Understand the database options available on AWS. You can run all types of databases on AWS. You should understand that there are managed and unmanaged options available, in addition to relational, nonrelational, caching, graph, and data warehouses.
Understand which databases Amazon RDS supports. Amazon RDS currently supports six relational database engines:
Understand the operational benefits of using Amazon RDS. Amazon RDS is an AWS managed service. AWS is responsible for patching, antivirus, and the management of the underlying guest OS for Amazon RDS. Amazon RDS greatly simplifies the process of setting a secondary slave with replication for failover and setting up read replicas to offload queries.
Remember that you cannot access the underlying OS for Amazon RDS DB instances. You cannot use Remote Desktop Protocol (RDP) or SSH to connect to the underlying OS. If you need to access the OS, install custom software or agents. If you want to use a database engine that Amazon RDS does not support, consider running your database on an Amazon EC2 instance instead.
Understand that Amazon RDS handles Multi-AZ failover for you. If your primary Amazon RDS instance becomes unavailable, AWS fails over to your secondary instance in another Availability Zone automatically. This failover is done by pointing your existing database endpoint to a new IP address. You do not have to change the connection string manually; AWS handles the DNS changes automatically.
Remember that Amazon RDS read replicas are used for scaling out and increased performance. This replication feature makes it easy to scale out your read-intensive databases. Read replicas are currently supported in Amazon RDS for MySQL, PostgreSQL, and Amazon Aurora. You can create one or more replicas of a database within a single AWS Region or across multiple AWS Regions. Amazon RDS uses native replication to propagate changes made to a source DB instance to any associated read replicas. Amazon RDS also supports cross-region read replicas to replicate changes asynchronously to another geography or AWS Region.
Know how to calculate throughput for Amazon DynamoDB. Remember that one read capacity unit (RCU) represents one strongly consistent read per second or two eventually consistent reads per second for an item up to 4 KB in size. For writing data, know that one write capacity unit (WCU) represents one write per second for an item up to 1 KB in size. Be comfortable performing calculations to determine the appropriate setting for the RCU and WCU for a table.
Know that DynamoDB spreads RCUs and WCUs across partitions evenly. Recall that when you allocate your total RCUs and WCUs to a table, DynamoDB spreads these across your partitions evenly. For example, if you have 1,000 RCUs and you have 10 partitions, then you have 100 RCUs allocated to each partition.
Know the differences between a local secondary index and a global secondary index. Remember that you can create local secondary indexes only when you initially create the table; additionally, know that local secondary indexes must share the same partition key as the parent or source table. Conversely, you can create global secondary indexes at any time, with different partitions keys or sort keys.
Know the difference between eventually consistent and strongly consistent reads. Know that with eventually consistent reads, your application may retrieve data that is stale; but with strongly consistent reads, the data is always up-to-date.
Understand the purpose of caching data and which related services are available. Know why caching is important for your database tier and how it helps to improve your application performance. Additionally, understand the differences between the caching methods (lazy loading and write-through) and the corresponding AWS services (Amazon DynamoDB Accelerator (DAX), ElastiCache for Redis, and ElastiCache for Memcached).
In the following exercises, you will launch two types of databases: the first database is an SQL database on Amazon RDS, and the second is Amazon DynamoDB (NoSQL). For these sets of exercises, you will use the Python 3 SDK. You can download the Python 3 SDK at https://aws.amazon.com/sdk-for-python/.
Before you can create your first Amazon RDS database, you must create a security group so that you can allow traffic from your development server to communicate with the database tier. To do this, you must use an Amazon EC2 client to create the security group. Security groups are a component of the Amazon EC2 service, even though you can use them as part of Amazon RDS to secure your database tier.
To create the security group, run the following script:
# Excercise 4.1
import boto3
import json
import datetime
# Let's create some variables we'll use throughout these Excercises in Chapter 4
# NOTE: Here we are using a CIDR range for incoming traffic. We have set it to
0.0.0.0/0 which means
# ANYONE on the internet can access your database if they have the username and
the password
# If possible, specify you're own CIDR range. You can figure out your CIDR range
by visiting the following link
# https://www.google.com/search?q=what+is+my+ip
# In the variable don't forget to add /32!
# If you aren't sure, leave it open to the world
# Variables
sg_name = 'rds-sg-dev-demo'
sg_description = 'RDS Security Group for AWS Dev Study Guide'
my_ip_cidr = '0.0.0.0/0'
# Create the EC2 Client to create the Security Group for your Database
ec2_client = boto3.client('ec2')
# First we need to create a security group
response = ec2_client.create_security_group(
Description=sg_description,
GroupName=sg_name)
print(json.dumps(response, indent=2, sort_keys=True))
# Now add a rule for the security group
response = ec2_client.authorize_security_group_ingress(
CidrIp=my_ip_cidr,
FromPort=3306,
GroupName=sg_name,
ToPort=3306,
IpProtocol='tcp'
)
print("Security Group should be created! Verify this in the AWS Console.")
After running the Python code, verify that the security group was created successfully from the AWS Management Console. You can find this confirmation under the VPC or Amazon EC2 service.
Use the Python SDK to spin up your MariaDB database hosted on Amazon RDS.
To spin up the MariaDB database, run the following script and update the Variables section to meet your needs:
# Excercise 4.2
import boto3
import json
import datetime
# Just a quick helper function for date time conversions, in case you want to
print the raw JSON
def date_time_converter(o):
if isinstance(o, datetime.datetime):
return o.__str__()
# Variables
sg_name = 'rds-sg-dev-demo'
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
admin_email = '[email protected]'
sg_id_number = ''
rds_endpoint = ''
# We need to get the Security Group ID Number to use in the creation of the RDS
Instance
ec2_client = boto3.client('ec2')
response = ec2_client.describe_security_groups(
GroupNames=[
sg_name
])
sg_id_number = json.dumps(response['SecurityGroups'][0]['GroupId'])
sg_id_number = sg_id_number.replace('"','')
# Create the client for Amazon RDS
rds_client = boto3.client('rds')
# This will create our MariaDB Database
# NOTE: Here we are hardcoding passwords for simplicity and testing purposes
only! In production
# you should never hardcode passwords in configuration files/code!
# NOTE: This will create an MariaDB Database. Be sure to remove it when you are
done.
response = rds_client.create_db_instance(
DBInstanceIdentifier=rds_identifier,
DBName=db_name,
DBInstanceClass='db.t2.micro',
Engine='mariadb',
MasterUsername='masteruser',
MasterUserPassword='mymasterpassw0rd1!',
VpcSecurityGroupIds=[
sg_id_number
],
AllocatedStorage=20,
Tags=[
{
'Key': 'POC-Email',
'Value': admin_email
},
{
'Key': 'Purpose',
'Value': 'AWS Developer Study Guide Demo'
}
]
)
# We need to wait until the DB Cluster is up!
print('Creating the RDS instance. This may take several minutes...')
waiter = rds_client.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=rds_identifier)
print('Okay! The Amazon RDS Database is up!')
After the script has executed, the following message is displayed:
Creating the RDS instance. This may take several minutes.
After the Amazon RDS database instance has been created successfully, the following confirmation is displayed:
Okay! The Amazon RDS Database is up!
You can also view these messages from the Amazon RDS console.
Before you can start using the Amazon RDS instance, you must first specify your endpoint. In this exercise, you will use the Python SDK to obtain the value.
To obtain the Amazon RDS endpoint, run the following script:
# Exercise 4.3
import boto3
import json
import datetime
# Just a quick helper function for date time conversions, in case you want to print the raw JSON
def date_time_converter(o):
if isinstance(o, datetime.datetime):
return o.__str__()
# Variables
rds_identifier = 'my-rds-db'
# Create the client for Amazon RDS
rds_client = boto3.client('rds')
print("Fetching the RDS endpoint...")
response = rds_client.describe_db_instances(
DBInstanceIdentifier=rds_identifier
)
rds_endpoint = json.dumps(response['DBInstances'][0]['Endpoint']['Address'])
rds_endpoint = rds_endpoint.replace('"','')
print('RDS Endpoint: ' + rds_endpoint)
After running the Python code, the following status is displayed:
Fetching the RDS endpoint.. RDS Endpoint:<endpoint_name>
If the endpoint is not returned, from the AWS Management Console, under the RDS service, verify that your Amazon RDS database instance was created.
You now have all the necessary information to create your first SQL table by using Amazon RDS. In this exercise, you will create a SQL table and add a couple of records. Remember to update the variables for your specific environment.
To update the variables, run the following script:
# Exercise 4.4
import boto3
import json
import datetime
import pymysql as mariadb
# Variables
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
rds_endpoint = 'my-rds-db.****.us-east-1.rds.amazonaws.com'
# Step 1 - Connect to the database to create the table
db_connection = mariadb.connect(host=rds_endpoint, user=user_name,
password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
cursor.execute("CREATE TABLE Users (user_id INT NOT NULL AUTO_INCREMENT,
user_fname VARCHAR(100) NOT NULL, user_lname VARCHAR(150) NOT NULL, user_
email VARCHAR(175) NOT NULL, PRIMARY KEY (`user_id`))")
print('Table Created!')
except mariadb.Error as e:
print('Error: {}'.format(e))
finally:
db_connection.close()
# Step 2 - Connect to the database to add users to the table
db_connection = mariadb.connect(host=rds_endpoint, user=user_name,
password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
sql = "INSERT INTO `Users` (`user_fname`, `user_lname`, `user_email`) VALUES (%s, %s, %s)"
cursor.execute(sql, ('CJ', 'Smith', '[email protected]'))
cursor.execute(sql, ('Casey', 'Smith', '[email protected]'))
cursor.execute(sql, ('No', 'One', '[email protected]'))
# No data is saved unless we commit the transaction!
db_connection.commit()
print('Inserted Data to Database!')
except mariadb.Error as e:
print('Error: {}'.format(e))
print('Sorry, something has gone wrong!')
finally:
db_connection.close()
After running the Python code, the following confirmation is displayed:
Table Created! Inserted Data to the Database!
Your Amazon RDS database now has some data stored in it.
In this exercise, you are hardcoding a password into your application code for demonstration purposes only. In a production environment, refrain from hard-coding application passwords. Instead, use services such as AWS Secrets Manager to keep your secrets secure.
After adding data to your SQL database, in this exercise you will be able to read or query the items in the Users table.
To read the items in the SQL table, run the following script:
# Exercise 4.5
import boto3
import json
import datetime
import pymysql as mariadb
# Variables
rds_identifier = 'my-rds-db'
db_name = 'mytestdb'
user_name = 'masteruser'
user_password = 'mymasterpassw0rd1!'
rds_endpoint = 'my-rds-db.*****.us-east-1.rds.amazonaws.com'
db_connection = mariadb.connect(host=rds_endpoint, user=user_name, password=user_password, database=db_name)
cursor = db_connection.cursor()
try:
sql = "SELECT * FROM `Users`"
cursor.execute(sql)
query_result = cursor.fetchall()
print('Querying the Users Table...')
print(query_result)
except mariadb.Error as e:
print('Error: {}'.format(e))
print('Sorry, something has gone wrong!')
finally:
db_connection.close()
After running the Python code, you will see the three records that you inserted in the previous exercise.
You’ve created an Amazon RDS DB instance and added data to it. In this exercise, you will remove a few resources from your account. Remove the Amazon RDS instance first.
To remove the Amazon RDS instance and the security group, run the following script:
# Exercise 4.6
import boto3
import json
import datetime
# Variables
rds_identifier = 'my-rds-db'
sg_name = 'rds-sg-dev-demo'
sg_id_number = ''
# Create the client for Amazon RDS
rds_client = boto3.client('rds')
# Delete the RDS Instance
response = rds_client.delete_db_instance(
DBInstanceIdentifier=rds_identifier,
SkipFinalSnapshot=True)
print('RDS Instance is being terminated...This may take several minutes.')
waiter = rds_client.get_waiter('db_instance_deleted')
waiter.wait(DBInstanceIdentifier=rds_identifier)
# We must wait to remove the security groups until the RDS database has been deleted, this is a dependency.
print('The Amazon RDS database has been deleted. Removing Security Groups')
# Create the client for Amazon EC2 SG
ec2_client = boto3.client('ec2')
# Get the Security Group ID Number
response = ec2_client.describe_security_groups(
GroupNames=[
sg_name
])
sg_id_number = json.dumps(response['SecurityGroups'][0]['GroupId'])
sg_id_number = sg_id_number.replace('"','')
# Delete the Security Group!
response = ec2_client.delete_security_group(
GroupId=sg_id_number
)
print('Cleanup is complete!')
After running the Python code, the following message is displayed:
Cleanup is complete!
The Amazon RDS database and the security group are removed. You can verify this from the AWS Management Console.
Amazon DynamoDB is a managed NoSQL database. One major difference between DynamoDB and Amazon RDS is that DynamoDB doesn’t require a server that is running in your VPC, and you don’t need to specify an instance type. Instead, create a table.
To create the table, run the following script:
# Exercise 4.7
import boto3
import json
import datetime
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.create_table(
TableName='Users',
KeySchema=[
{
'AttributeName': 'user_id',
'KeyType': 'HASH'
},
{
'AttributeName': 'user_email',
'KeyType': 'RANGE'
}
],
AttributeDefinitions=[
{
'AttributeName': 'user_id',
'AttributeType': 'S'
},
{
'AttributeName': 'user_email',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
}
)
print("The DynamoDB Table is being created, this may take a few minutes...")
table.meta.client.get_waiter('table_exists').wait(TableName='Users')
print("Table is ready!")
After running the Python code, the following message is displayed:
Table is ready!
From the AWS Management Console, under DynamoDB, verify that the table was created.
With DynamoDB, there are fewer components to set up than there are for Amazon RDS. In this exercise, you’ll add users to your table. Experiment with updating and changing some of the code to add multiple items to the database.
To add users to the DynamoDB table, run the following script: # Exercise 4.8
import boto3
import json
import datetime
# In this example we are not using uuid; however, you could use this to autogenerate your user IDs.
# i.e. str(uuid.uuid4())
import uuid
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
# Write a record to DynamoDB
response = table.put_item(
Item={
'user_id': '1234-5678',
'user_email': '[email protected]',
'user_fname': 'Sam',
'user_lname': 'Samuels'
}
)
# Just printing the raw JSON response, you should see a 200 status code
print(json.dumps(response, indent=2, sort_keys=True))
After running the Python code, you receive a 200 HTTP Status Code from AWS. This means that the user record has been added.
From the AWS Management Console, under DynamoDB, review the table to verify that the user record was added.
In this exercise, you look up the one user you’ve added so far.
To look up users in the DynamoDB table, run the following script:
# Exercise 4.9
import boto3
from boto3.dynamodb.conditions import Key
import json
import datetime
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
# Query a some data
response = table.query(
KeyConditionExpression=Key('user_id').eq('1234-5678')
)
# Print the data out!
print(json.dumps(response['Items'], indent=2, sort_keys=True))
After running the Python code, the query results are returned in JSON format showing a single user.
In this exercise, you will write data to the table through a batch process.
To write data using a batch process, run the following script:
# Exercise 4.10
import boto3
import json
import datetime
import uuid
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
# Generate some random data
with table.batch_writer() as user_data:
for i in range(100):
user_data.put_item(
Item={
'user_id': str(uuid.uuid4()),
'user_email': 'someone' + str(i) + '@somewhere.com',
'user_fname': 'User' + str(i),
'user_lname': 'UserLast' + str(i)
}
)
print('Writing record # ' + str(i+1) + ' to DynamoDB Users Table')
print('Done!')
After running the Python code, the last few lines read as follows:
Writing record # 300 to DyanmoDB Users Table Done!
From the AWS Management Console, under DynamoDB Table, verify that the users were written to the table.
In this exercise, you will scan the entire table.
To scan the table, run the following script:
# Exercise 4.11
import boto3
import json
import datetime
import uuid
# Create a DynamoDB Resource
dynamodb_resource = boto3.resource('dynamodb')
table = dynamodb_resource.Table('Users')
# Let's do a scan!
response = table.scan()
print('The total Count is: ' + json.dumps(response['Count']))
print(json.dumps(response['Items'], indent=2, sort_keys=True))
As you learned in this chapter, scans return the entire dataset located in the table. After running the script, all of the users are returned.
In this exercise, you will remove the DynamoDB table that you created in Exercise 4.7.
To remove the table, run the following script:
# Exercise 4.12
import boto3
import json
import datetime
import uuid
# Create a DynamoDB Resource
dynamodb_client = boto3.client('dynamodb')
# Delete the Table
response = dynamodb_client.delete_table(TableName='Users')
print(json.dumps(response, indent=2, sort_keys=True))
The DynamoDB table is deleted, or it is in the process of being deleted. Verify the deletion from the AWS Management Console, under the DynamoDB service.
Which of the following does Amazon Relational Database Service (Amazon RDS) manage on your behalf? (Select THREE.)
Which AWS database service is best suited for managing highly connected datasets?
You are designing an ecommerce web application that will scale to potentially hundreds of thousands of concurrent users. Which database technology is best suited to hold the session state for large numbers of concurrent users?
How many read capacity units (RCUs) do you need to support 25 strongly consistent reads per seconds of 15 KB?
How many read capacity units (RCUs) do you need to support 25 eventually consistent reads per seconds of 15 KB?
How many write capacity units (WCUs) are needed to support 100 writers per second of 512 bytes?
Your company is using Amazon DynamoDB, and they would like to implement a write-through caching mechanism. They would like to get everything up and running in only a few short weeks. Additionally, your company would like to refrain from managing any additional servers. You are the lead developer on the project; what should you recommend?
Your company would like to implement a highly available caching solution for its SQL database running on Amazon RDS. Currently, all of its services are running in the AWS Cloud. As their lead developer, what should you recommend?
A company is looking to run analytical queries and would like to implement a data warehouse. It estimates that it has roughly 300 TB worth of data, which is expected to double in the next three years. Which AWS service should you recommend?
A company is experiencing an issue with Amazon DynamoDB whereby the data is taking longer than expected to return from a query. You are tasked with investigating the problem. After looking at the application code, you realize that a Scan operation is being called for a large DynamoDB table. What should you do or recommend?