Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Hello, Storage

THE AWS CERTIFIED DEVELOPER – ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

Domain 2: Security
2.2 Implement encryption using AWS services.
2.3 Implement application authentication and authorization.
Domain 3: Development with AWS Services
3.2 Translate functional requirements into application design.
3.3 Implement application design into application code.
3.4 Write code that interacts with AWS services by using APIs, SDKs, and AWS CLI.
Domain 4: Refactoring
4.1 Optimize application to best use AWS services and features.
Domain 5: Monitoring and Troubleshooting
5.2 Perform root cause analysis on faults found in testing or production.

Introduction to AWS Storage

Cloud storage is a critical component of cloud computing, holding the information used by applications built by developers. In this chapter, we will walk you through the portfolio of storage services that AWS offers and decompose some phrases that you might have heard, such as data lake.

The internet era brought about new challenges for data storage and processing, which prompted the creation of new technologies. The latest generation of data stores are no longer multipurpose, single-box systems. Instead, they are complex, distributed systems optimized for a particular type of task at a particular scale. Because no single data store is ideal for all workloads, choosing a data store for the entire system will no longer serve you well. Instead, you need to consider each individual workload or component within the system and choose a data store that is right for it.

The AWS Cloud is a reliable, scalable, and secure location for your data. Cloud storage is typically more reliable, scalable, and secure than traditional, on-premises storage systems. AWS offers object storage, file storage, block storage, and data transfer services, which we will explore in this chapter. Figure 3.1 shows the storage and data transfer options on AWS.

The figure shows the storage and data transfer options on AWS. — **Figure 3.1** The AWS storage portfolio

This chapter covers how to provision storage using just-in-time purchasing, which helps you avoid overprovisioning and paying for unused storage into which you expect to grow eventually.

Storage Fundamentals

Before we explore the various AWS storage services, let’s review a few storage fundamentals. As a developer, you are likely already familiar with block storage and the differences between hot and cold storage. Cloud storage introduces some new concepts such as object storage, and we will compare these new concepts with the traditional storage concepts with which you are already familiar. If you have been working on the cloud already, these fundamentals are likely a refresher for you.

The goal of this chapter is to produce a mental model that will allow you, as a developer, to make the right decisions for choosing and implementing the best storage options for your applications. With the right mental model, people can usually make the best decisions for their solutions.

The AWS storage portfolio mental model starts with the core data building blocks, which include block, file, and object storage. For block storage, AWS has Amazon Elastic Block Store (Amazon EBS). For file storage, AWS has Amazon Elastic File System (Amazon EFS). For object storage, AWS has Amazon Simple Storage Service (Amazon S3) and Amazon S3 Glacier. Figure 3.2 illustrates this set of storage building blocks.

The figure shows a complete set of storage building blocks. A circle labeled “Storage types” is divided into three parts: Block, file and object and Archival. On the left-hand side, a box labeled “Data movement” shows seven icons. These icons are Hybrid Storage, Streaming Data, File Data, WAN Acceleration, Private Networks, Third-Party Applications and Physical Appliances. On the right-hand side, a box labeled “Data Security and Management” shows seven icons. These icons are Data Discovery and Protection, Data Visualization, Serverless Computing, Automation, Audit Trails Monitoring and Metrics, Access Controls and Encryption. — **Figure 3.2** A complete set of storage building blocks

Data Dimensions

When investigating which storage options to use for your applications, consider the different dimensions of your data first. In other words, find the right tool for your data instead of squeezing your data into a tool that might not be the best fit.

So, before you start considering storage options, take time to evaluate your data and decide under which of these dimensions your data falls. This will help you make the correct decisions about what type of storage is best for your data.

Think in terms of a data storage mechanism that is most suitable for a particular workload—not a single data store for the entire system. Choose the right tool for the job.

Velocity, Variety, and Volume

The first dimension to consider comprises the three Vs of big data: velocity, variety, and volume. These concepts are applicable to more than big data. It is important to identify these traits for any data that you are using in your applications.

Velocity Velocity is the speed at which data is being read or written, measured in reads per second (RPS) or writes per second (WPS). The velocity can be based on batch processing, periodic, near-real-time, or real-time speeds.

Variety Variety determines how structured the data is and how many different structures exist in the data. This can range from highly structured to loosely structured, unstructured, or binary large object (BLOB) data.

Highly structured data has a predefined schema, such as data stored in relational databases, which we will discuss in Chapter 4, “Hello, Databases.” In highly structured data, each entity of the same type has the same number and type of attributes, and the domain of allowed values for an attribute can be further constrained. The advantage of highly structured data is its self-described nature.

Loosely structured data has entities, which have attributes/fields. Aside from the field uniquely identifying an entity, however, the attributes are not required to be the same in every entity. This data is more difficult to analyze and process in an automated fashion, putting more of the burden of reasoning about the data on the consumer or application.

Unstructured data does not have any sense or structure. It has no entities or attributes. It can contain useful information, but it must be extracted by the consumer of the data.

BLOB data is useful as a whole, but there is often little benefit in trying to extract value from a piece or attribute of a BLOB. Therefore, the systems that store this data typically treat it as a black box and only need to be able to store and retrieve a BLOB as a whole.

Volume Volume is the total size of the dataset. There are two main uses for data: developing valuable insight and storage for later use. When getting valuable insights from data, having more data is often preferable to using better models. When keeping data for later use, be it for digital assets or backups, the more data that you can store, the less you need to guess what data to keep and what to throw away. These two uses prompt you to collect as much data as you can store, process, and afford to keep.

Typical metrics that measure the ability of a data store to support volume are maximum storage capacity and cost (such as $/GB).

Storage Temperature

Data temperature is another useful way of looking at data to determine the right storage for your application. It helps us understand how “lively” the data is: how much is being written or read and how soon it needs to be available.

Hot Hot data is being worked on actively; that is, new ingests, updates, and transformations are actively contributing to it. Both reads and writes tend to be single-item. Items tend to be small (up to hundreds of kilobytes). Speed of access is essential. Hot data tends to be high-velocity and low-volume.

Warm Warm data is still being actively accessed, but less frequently than hot data. Often, items can be as small as in hot workloads but are updated and read in sets. Speed of access, while important, is not as crucial as with hot data. Warm data is more balanced across the velocity and volume dimensions.

Cold Cold data still needs to be accessed occasionally, but updates to this data are rare, so reads can tolerate higher latency. Items tend to be large (tens of hundreds of megabytes or gigabytes). Items are often written and read individually. High durability and low cost are essential. Cold data tends to be high-volume and low-velocity.

Frozen Frozen data needs to be preserved for business continuity or for archival or regulatory reasons, but it is not being worked on actively. While new data is regularly added to this data store, existing data is never updated. Reads are extremely infrequent (known as “write once, read never”) and can tolerate very high latency. Frozen data tends to be extremely high-volume and extremely low-velocity.

The same data can start as hot and gradually cool down. As it does, the tolerance of read latency increases, as does the total size of the dataset. Later in this chapter, we explore individual AWS services and discuss which services are optimized for the dimensions that we have discussed so far.

Data Value

Although we would like to extract useful information from all of the data we collect, not all data is equally important to us. Some data has to be preserved at all costs, and other data can be easily regenerated as needed or even lost without significant impact on the business. Depending on the value of data, we are more or less willing to invest in additional durability.

To optimize cost and/or performance further, segment data within each workload by value and temperature, and consider different data storage options for different segments.

Transient data Transient data is often short-lived. The loss of some subset of transient data does not have significant impact on the system as a whole. Examples include clickstream or Twitter data. We often do not need high durability of this data, because we expect it to be quickly consumed and transformed further, yielding higher-value data. If we lose a tweet or a few clicks, this is unlikely to affect our sentiment analysis or user behavior analysis.

Not all streaming data is transient, however. For example, for an intrusion detection system (IDS), every record representing network communication can be valuable because every log record can be valuable for a monitoring/alarming system.

Reproducible data Reproducible data contains a copy of useful information that is often created to improve performance or simplify consumption, such as adding more structure or altering a structure to match consumption patterns. Although the loss of some or all of this data may affect a system’s performance or availability, this will not result in data loss, because the data can be reproduced from other data sources.

Examples include data warehouse data, read replicas of OLTP (online transaction processing) systems, and many types of caches. For this data, we may invest a bit in durability to reduce the impact on system’s performance and availability, but only to a point.

Authoritative data Authoritative data is the source of truth. Losing this data will have significant business impact because it will be difficult, or even impossible, to restore or replace it. For this data, we are willing to invest in additional durability. The greater the value of this data, the more durability we will want.

Critical/Regulated data Critical or regulated data is data that a business must retain at almost any cost. This data tends to be stored for long periods of time and needs to be protected from accidental and malicious changes—not just data loss or corruption. Therefore, in addition to durability, cost and security are equally important factors.

One Tool Does Not Fit All

Despite the many applications of a hammer, it cannot replace a screwdriver or a pair of pliers. Likewise, there is no one-size-fits-all solution for data storage. Analyze your data and understand the dimensions that we have discussed. Once you have done that, then you can move on to reviewing the different storage options available on AWS to find the right tool to store and access your files.

For the exam, know the availability, level of durability, and cost factors for each storage option and how they compare.

Block, Object, and File Storage

There are three types of cloud storage: object, file, and block. Each offers its own unique advantages.

Block Storage

Some enterprise applications, like databases or enterprise resource planning systems (ERP systems), can require dedicated, low-latency storage for each host. This is analogous to direct-attached storage (DAS) or a storage area network (SAN). Block-based cloud storage solutions like Amazon EBS are provisioned with each Amazon Elastic Compute Cloud (Amazon EC2) instance and offer the ultra-low latency required for high-performance workloads.

Object Storage

Applications developed on the cloud often take advantage of object storage’s vast scalability and metadata characteristics. Object storage solutions like Amazon S3 are ideal for building modern applications from scratch that require scale and flexibility and can also be used to import existing data stores for analytics, backup, or archive.

Cloud object storage makes it possible to store virtually limitless amounts of data in its native format.

File Storage

Many applications need to access shared files and require a file system. This type of storage is often supported with a network-attached storage (NAS) server. File storage solutions like Amazon EFS are ideal for use cases such as large content repositories, development environments, media stores, or user home directories.

AWS Shared Responsibility Model and Storage

The AWS shared responsibility model is important to understand as it relates to cloud storage. AWS is responsible for securing the storage services. As a developer and customer, you are responsible for securing access to and using encryption on the artifacts you create or objects you store.

AWS makes this model simpler for you by allowing you to inherit certain compliance factors and controls, but you must still ensure that you are securing your data and files on the cloud. It is a best practice always to use the principle of least privilege as part of your responsibility for using AWS Cloud storage. For example, ensure that only those who need access to the file have access and ensure that read and write access are separated and controlled.

Confidentiality, Integrity, Availability Model

The confidentiality, integrity, availability model (CIA model) forms the fundamentals of information security, and you can apply the principles of the CIA model to AWS storage.

Confidentiality can be equated to the privacy level of your data. It refers to levels of encryption or access policies for your storage or individual files. With this principle, you will limit access to prevent accidental information disclosure by restricting permissions and enabling encryption.

Integrity refers to whether your data is trustworthy and accurate. For example, can you trust that the file you generated has not been changed when it is audited later?

Restrict permission of who can modify data and enable backup and versioning.

Availability refers to the availability of a service on AWS for storage, where an authorized party can gain reliable access to the resource.

Restrict permission of who can delete data, enable multi-factor authentication (MFA) for Amazon S3 delete operation, and enable backup and versioning.

Figure 3.3 shows the CIA model.

The figure shows the CIA model in a triangular pattern. A small inverted triangle, labeled “Information security” is inscribed in large triangle. The triangle is divided into three parts. First part is labeled “AVAILABILITY,” second part is labeled “INTEGRITY” and third part is labeled “CONFIDENTIALITY.” — **Figure 3.3** The CIA model

AWS storage services provide many features for maintaining the desired level of confidentiality, integrity, and availability. Each of these features is discussed under its corresponding storage-option section in this chapter.

AWS Block Storage Services

Let’s begin with the storage to which you are most likely already accustomed as a developer; that is, block storage.

Amazon Elastic Block Store

Amazon EBS presents your data to your Amazon EC2 instance as a disk volume, providing the lowest-latency access to your data from single Amazon EC2 instances.

Amazon EBS provides durable and persistent block storage volumes for use with Amazon EC2 instances. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect your information from component failure, offering high availability and durability. Amazon EBS volumes offer the consistent and low-latency performance needed to run your workloads. With Amazon EBS, you can scale your usage up or down within minutes, while paying only for what you provision.

Typical use cases for Amazon EBS include the following:

Boot volumes on Amazon EC2 instances
Relational and NoSQL databases
Stream and log processing applications
Data warehousing applications.
Big data analytics engines (like the Hadoop/HDFS (Hadoop Distributed File System) ecosystem and Amazon EMR clusters)

Amazon EBS is designed to achieve the following:

Availability of 99.999 percent
Durability of replication within a single availability zone
Annual failure rate (AFR) of between 0.1 and 0.2 percent

Amazon EBS volumes are 20 times more reliable than typical commodity disk drives, which fail with an AFR of around 4 percent.

Amazon EBS Volumes

Amazon EBS volumes persist independently from the running life of an Amazon EC2 instance. After a volume is attached to an instance, use it like any other physical hard drive.

Amazon EBS volumes are flexible. For current-generation volumes attached to current-generation instance types, you can dynamically increase size, modify provisioned input/output operations per second (IOPS) capacity, and change the volume type on live production volumes without service interruptions.

Amazon EBS provides the following volume types, which differ in performance characteristics and price so that you can tailor your storage performance and cost to the needs of your applications.

SSD-backed volumes Solid-state drive (SSD)–backed volumes are optimized for transactional workloads involving frequent read/write operations with small I/O size, where the dominant performance attribute is IOPS.

HDD-backed volumes Hard disk drive (HDD)–backed volumes are optimized for large streaming workloads where throughput (measured in MiB/s) is a better performance measure than IOPS.

SSD vs. HDD Comparison

Table 3.1 shows a comparison of Amazon EBS HDD-backed and SSD-backed volumes.

Table 3.1 Volume Comparison

	SSD		HDD
	General Purpose	Provisioned IOPS	Throughput-Optimized	Cold
Max volume size	16 TiB
Max IOPS/volume	10,000	32,000	500	250
Max throughput/volume	160 MiB/s	500 MiB/s		250 MiB/s

Table 3.2 shows the most common use cases for the different types of Amazon EBS volumes.

Table 3.2 EBS Volume Use Cases

SSD		HDD
General Purpose	Provisioned IOPS	Throughput-Optimized	Cold
Recommended for most workloads System boot volumes Virtual desktops Low-latency interactive Apps Development and test environments	I/O-intensive workloads Relational DBs NoSQL DBs	Streaming workloads requiring consistent, fast throughput at a low price Big data Data warehouses Log processing Cannot be a boot volume	Throughput-oriented storage for large volumes of data that is infrequently accessed Scenarios where the lowest storage cost is important Cannot be a boot volume

Elastic Volumes

Elastic Volumes is a feature of Amazon EBS that allows you to increase capacity dynamically, tune performance, and change the type of volume live. This can be done with no downtime or performance impact and with no changes to your application. Create a volume with the capacity and performance needed when you are ready to deploy your application, knowing that you have the ability to modify your volume configuration in the future and saving hours of planning cycles and preventing overprovisioning.

Amazon EBS Snapshots

You can protect your data by creating point-in-time snapshots of Amazon EBS volumes, which are backed up to Amazon S3 for long-term durability. The volume does not need to be attached to a running instance to take a snapshot.

As you continue to write data to a volume, periodically create a snapshot of the volume to use as a baseline for new volumes. These snapshots can be used to create multiple new Amazon EBS volumes or move volumes across Availability Zones.

When you create a new volume from a snapshot, it is an exact copy of the original volume at the time the snapshot was taken.

If you are taking snapshots at regular intervals, such as once per day, you may be concerned about the cost of the storage. Snapshots are incremental backups, meaning that only the blocks on the volume that have changed after your most recent snapshot are saved, making this a cost-effective way to back up your block data. For example, if you have a volume with 100 GiB of data, but only 5 GiB of data have changed since your last snapshot, only the 5 GiB of modified data is written to Amazon S3.

If you need to delete a snapshot, how do you know which snapshot to delete? Amazon EBS handles this for you. Even though snapshots are saved incrementally, the snapshot deletion process is designed so that you need to retain only the most recent snapshot to restore the volume. Amazon EBS will determine which dependent snapshots can be deleted to ensure that all other snapshots continue working.

Amazon EBS Optimization

Recall that Amazon EBS volumes are network-attached and not directly attached to the host like instance stores. On instances without support for Amazon EBS–optimized throughput, network traffic can contend with traffic between your instance and your Amazon EBS volumes. On Amazon EBS–optimized instances, the two types of traffic are kept separate. Some instance configurations incur an extra cost for using Amazon EBS–optimized, while others are always Amazon EBS–optimized at no extra cost.

Amazon EBS Encryption

For simplified data encryption, create encrypted Amazon EBS volumes with the Amazon EBS encryption feature. All Amazon EBS volume types support encryption, and you can use encrypted Amazon EBS volumes to meet a wide range of data-at-rest encryption requirements for regulated/audited data and applications.

Amazon EBS encryption uses 256-bit Advanced Encryption Standard (AES-256) algorithms and an Amazon-managed key infrastructure called AWS Key Management Service (AWS KMS). The encryption occurs on the server that hosts the Amazon EC2 instance, providing encryption of data in transit from the Amazon EC2 instance to Amazon EBS storage.

You can encrypt using an AWS KMS–generated key, or you can choose to select a customer master key (CMK) that you create separately using AWS KMS.

You can also encrypt your files prior to placing them on the volume. Snapshots of encrypted Amazon EBS volumes are automatically encrypted. Amazon EBS volumes that are restored from encrypted snapshots are also automatically encrypted.

Amazon EBS Performance

To achieve optimal performance from your Amazon EBS volumes in a variety of scenarios, use the following best practices:

Use Amazon EBS-optimized instances The dedicated network throughput that you get when you request Amazon EBS–optimized support will make volume performance more predictable and consistent, and your Amazon EBS volume network traffic will not have to contend with your other instance traffic because they are kept separate.

Understand how performance is calculated When you measure the performance of your Amazon EBS volumes, it is important to understand the units of measure involved and how performance is calculated.

Understand your workload There is a relationship between the maximum performance of your Amazon EBS volumes, the size and number of I/O operations, and the time it takes for each action to complete. Each of these factors affects the others, and different applications are more sensitive to one factor or another.

On a given volume configuration, certain I/O characteristics drive the performance behavior for your Amazon EBS volumes. SSD-backed volumes, General-Purpose SSD, and Provisioned IOPS SSD deliver consistent performance whether an I/O operation is random or sequential. HDD-backed volumes, Throughput-Optimized HDD, and Cold HDD deliver optimal performance only when I/O operations are large and sequential.

To understand how SSD and HDD volumes will perform in your application, it is important to understand the connection between demand on the volume, the quantity of IOPS available to it, the time it takes for an I/O operation to complete, and the volume’s throughput limits.

Be aware of the performance penalty when initializing volumes from snapshots New Amazon EBS volumes receive their maximum performance the moment that they are available and do not require initialization (formerly known as pre-warming).

Storage blocks on volumes that were restored from snapshots, however, must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. Performance is restored after the data is accessed once.

For most applications, amortizing this cost over the lifetime of the volume is acceptable. For some applications, however, this performance hit is not acceptable. If that is the case, avoid a performance hit by accessing each block prior to putting the volume into production. This process is called initialization.

Factors that can degrade HDD performance When you create a snapshot of a Throughput-Optimized HDD or Cold HDD volume, performance may drop as far as the volume’s baseline value while the snapshot is in progress. This behavior is specific only to these volume types.

Other factors that can limit performance include the following:

Driving more throughput than the instance can support
The performance penalty encountered when initializing volumes restored from a snapshot
Excessive amounts of small, random I/O on the volume

Increase read-ahead for high-throughput, read-heavy workloads Some workloads are read-heavy and access the block device through the operating system page cache (for example, from a file system). In this case, to achieve the maximum throughput, we recommend that you configure the read-ahead setting to 1 MiB. This is a per-block-device setting that should be applied only to your HDD volumes.

Use RAID 0 to maximize utilization of instance resources Some instance types can drive more I/O throughput than what you can provision for a single Amazon EBS volume. You can join multiple volumes of certain instance types together in a RAID 0 configuration to use the available bandwidth for these instances.

Track performance with Amazon CloudWatch Amazon CloudWatch, a monitoring and management service, provides performance metrics and status checks for your Amazon EBS volumes.

Amazon EBS Troubleshooting

If you are using an Amazon EBS volume as a boot volume, your instance is no longer accessible, and you cannot use SSH or Remote Desktop Protocol (RDP) to access that boot volume. There are some steps that you can take, however, to access the volume.

If you have an Amazon EC2 instance based on an Amazon Machine Image (AMI), you may just choose to terminate the instance and create a new one.

If you do need access to that Amazon EBS boot volume, perform the following steps to make it accessible:

Create a new Amazon EC2 instance with its own boot volume (a micro instance is great for this purpose).
Detach the root Amazon EBS volume from the troubled instance.
Attach the root Amazon EBS volume from the troubled instance to your new Amazon EC2 instance as a secondary volume.
Connect to the new Amazon EC2 instance, and access the files on the secondary volume.

Instance Store

Amazon EC2 instance store is another type of block storage available to your Amazon EC2 instances. It provides temporary block-level storage, and the storage is located on disks that are physically attached to the host computer (unlike Amazon EBS volumes, which are network-attached).

If your data does not need to be resilient to reboots, restarts, or auto recovery, then your data may be a candidate for using instance store, but you should exercise caution.

Instance Store Volumes

Instance store should not be used for persistent storage needs. It is a type of ephemeral (short-lived) storage that does not persist if the instance fails or is terminated.

Because instance store is on the host of your Amazon EC2 instance, it will provide the lowest-latency storage to your instance other than RAM. Instance store volumes can be used when incurring large amounts of I/O for your application at the lowest possible latency. You need to ensure that you have another source of truth of your data, however, and that the only copy is not placed on instance store. For data that needs to be durable, we recommend using Amazon EBS volumes instead.

Not all instance types come with available instance store volume(s), and the size and type of volumes vary by instance type. When you launch an instance, the instance store is available at no additional cost, depending on the particular instance type. However, you must enable these volumes when you launch an Amazon EC2 instance, as you cannot add instance store volumes to an Amazon EC2 instance once it has been launched.

After you launch an instance, the instance store volumes are available to the instance, but you cannot access them until they are mounted. Refer to the AWS documentation for Amazon EBS to learn more about mounting these volumes on different operating systems.

Many customers use a combination of instance store and Amazon EBS volumes with their instances. For example, you may choose to place your scratch data, tempdb, or other temporary files on instance store while your root volume is on Amazon EBS.

Do not use instance store for any production data.

Instance Store–Backed Amazon EC2 Instances

With Amazon EC2, you can use both instance store–backed storage volumes and Amazon EBS–backed storage volumes with your instances, meaning you can have your instance boot off instance store; however, you would want this configured so that you are using an AMI and that new instances will be created if one fails. This is not recommended for your primary instances where it would cause an issue for users if the instance fails. This configuration can save money on storage costs instead of using Amazon EBS as your boot volume in cases where your system is configured to be resilient to instances re-launching. It is critical to understand your application and infrastructure needs before choosing to use instance store-backed Amazon EC2 instances, so choose carefully.

Instance store–backed Amazon EC2 instances cannot be stopped and cannot take advantage of the auto recovery feature for Amazon EC2 instances.

Some AWS customers build instances on the fly that are completely resilient to reboot, relaunch, or failure and use instance store as their root volumes. This requires important due diligence regarding your application and infrastructure to ensure that this type of scenario would be right for you.

AWS Object Storage Services

Now we are going to dive into object storage. An object is a piece of data like a document, image, or video that is stored with some metadata in a flat structure. Object storage provides that data to applications via application programming interfaces (APIs) over the internet.

Amazon Simple Storage Service

Building a web application, which delivers content to users by retrieving data via making API calls over the internet, is not a difficult task with Amazon S3. Amazon Simple Storage Service (Amazon S3) is storage for the internet. It is a simple storage service that offers software developers a highly scalable, reliable, and low-latency data storage infrastructure at low cost. AWS has seen enormous growth with Amazon S3, and AWS currently has customers who store terabytes and exabytes of data.

Amazon S3 is featured in many AWS certifications because it is a core enabling service for many applications and use cases.

To begin developing with Amazon S3, it is important to understand a few basic concepts.

Buckets

A bucket is a container for objects stored in Amazon S3. Every object is contained in a bucket. You can think of a bucket in traditional terminology similar to a drive or volume.

Limitations

The following are limitations of which you should be aware when using Amazon S3 buckets:

Do not use buckets as folders, because there is a maximum limit of 100 buckets per account.
You cannot create a bucket within another bucket.
A bucket is owned by the AWS account that created it, and bucket ownership is not transferable.
A bucket must be empty before you can delete it.
After a bucket is deleted, that name becomes available to reuse, but the name might not be available for you to reuse for various reasons, such as someone else taking the name after you release it when deleting the bucket. If you expect to use same bucket name, do not delete the bucket.

You can only create up to 100 buckets per account. Do not use buckets as folders or design your application in a way that could result in more than 100 buckets as your application or data grows.

Universal Namespace

A bucket name must be unique across all existing bucket names in Amazon S3 across all of AWS—not just within your account or within your chosen AWS Region. You must comply with Domain Name System (DNS) naming conventions when choosing a bucket name.

The rules for DNS-compliant bucket names are as follows:

Bucket names must be at least 3 and no more than 63 characters long.
A bucket name must consist of a series of one or more labels, with adjacent labels separated by a single period (.).
A bucket name must contain lowercase letters, numbers, and hyphens.
Each label must start and end with a lowercase letter or number.
Bucket names must not be formatted like IP addresses (for example, 192.168.5.4).
AWS recommends that you do not use periods (.) in bucket names. When using virtual hosted-style buckets with Secure Sockets Layer (SSL), the SSL wildcard certificate only matches buckets that do not contain periods. To work around this, use HTTP or write your own certificate verification logic.

Amazon S3 bucket names must be universally unique.

Table 3.3 shows examples of invalid bucket names.

Table 3.3 Invalid Bucket Names

Bucket Name	Reason
.myawsbucket	The bucket name cannot start with a period (.).
myawsbucket.	The bucket name cannot end with a period (.).
my..examplebucket	There can be only one period between labels.

The following code snippet is an example of creating a bucket using Java:

private static String bucketName     = "*** bucket name ***";
public static void main(String[] args) throws IOException {
AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
s3client.setRegion(Region.getRegion(Regions.US_WEST_1));
if(!(s3client.doesBucketExist(bucketName))){
     // Note that CreateBucketRequest does not specify region. So bucket is
     // created in the region specified in the client.
     s3client.createBucket(new CreateBucketRequest(bucketName));
     }

// Get location.
String bucketLocation = s3client.getBucketLocation(new GetBucketLocationRequest (bucketName));
System.out.println("bucket location = " + bucketLocation);

Versioning

Versioning is a means of keeping multiple variants of an object in the same bucket. You can use versioning to preserve, retrieve, and restore every version of every object stored in your Amazon S3 bucket, including recovering deleted objects. With versioning, you can easily recover from both unintended user actions and application failures.

There are several reasons that developers will turn on versioning of files in Amazon S3, including the following:

Protecting from accidental deletion
Recovering an earlier version
Retrieving deleted objects

Versioning is turned off by default. When you turn on versioning, Amazon S3 will create new versions of your object every time you overwrite a particular object key. Every time you update an object with the same key, Amazon S3 will maintain a new version of it.

In Figure 3.4, you can see that we have uploaded the same image multiple times, and all of the previous versions of those files have been maintained.

The figure shows an example of Amazon S3 versioning. — **Figure 3.4** Amazon S3 versioning

As those additional writes apply to a bucket, you can retrieve any of the particular objects that you need using GET on the object key name and the particular version. Amazon S3 versioning tracks the changes over time.

Amazon S3 versioning also protects against unintended deletes. If you issue a delete command against an object in a versioned bucket, AWS places a delete marker on top of that object, which means that if you perform a GET on it, you will receive an error as if the object does not exist. However, an administrator, or anyone else with the necessary permissions, could remove the delete marker and access the data.

When a delete request is issued against a versioned bucket on a particular object, Amazon S3 still retains the data, but it removes access for users to retrieve that data.

Versioning-enabled buckets let you recover objects from accidental deletion or overwrite. Your bucket’s versioning configuration can also be MFA Delete–enabled for an additional layer of security. MFA Delete is discussed later in this chapter.

If you overwrite an object, it results in a new object version in the bucket. You can always restore from any previous versions.

In one bucket, for example, you can have two objects with the same key, but different version IDs, such as photo.gif (version 111111) and photo.gif (version 121212). This is illustrated in Figure 3.5.

The figure shows Amazon S3 object version IDs. — **Figure 3.5** Amazon S3 object version IDs

Later in this chapter, we will cover lifecycle policies. You can use versioning in combination with lifecycle policies to implement them if the object is the current or previous version. If you are concerned about building up many versions and using space for a particular object, configure a lifecycle policy that will delete the old version of the object after a certain period of time.

It is easy to set up a lifecycle policy to control the amount of data that’s being retained when you use versioning on a bucket.

If you need to discontinue versioning on a bucket, copy all of your objects to a new bucket that has versioning disabled and use that bucket going forward.

Once you enable versioning on a bucket, it can never return to an unversioned state. You can, however, suspend versioning on that bucket.

It is important to be aware of the cost implications of the bucket that is versioning-enabled. When calculating cost for your bucket, you must calculate as though every version is a completely separate object that takes up the same space as the object itself. As you can probably guess, this option might be cost prohibitive for things like large media files or performing many updates on objects.

Region

Amazon S3 creates buckets in a region that you specify. You can choose any AWS Region that is geographically close to you to optimize latency, minimize costs, or address regulatory requirements.

Objects belonging to a bucket that you create in a specific AWS Region never leave that region unless you explicitly transfer them to another region.

Operations on Buckets

There are a number of different operations (API calls) that you can perform on Amazon S3 buckets. We will summarize a few of the most basic operations in this section. For more comprehensive information on all of the different operations that you can perform, refer to the Amazon S3 API Reference document available in the AWS Documentation repository. In this section, we show you how to create a bucket, list buckets, and delete a bucket.

Create a Bucket

This sample Python code shows how to create a bucket:

import boto3

s3 = boto3.client('s3')
s3.create_bucket(Bucket='my-bucket')

List Buckets

This sample Python code demonstrates getting a list of all of the bucket names available:

import boto3

# Create an S3 client
s3 = boto3.client('s3')

# Call S3 to list current buckets
response = s3.list_buckets()

# Get a list of all bucket names from the response
buckets = [bucket['Name'] for bucket in response['Buckets']]

# Print out the bucket list
print("Bucket List: %s" % buckets)

Delete a Bucket

The following sample Java code shows you how to delete a bucket. Buckets must be empty before you can delete them, unless you use a force parameter.

import java.io.IOException;

import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.DeleteObjectRequest;

public class DeleteObjectNonVersionedBucket {

    public static void main(String[] args) throws IOException {
        String clientRegion = "*** Client region ***";
        String bucketName = "*** Bucket name ***";
        String keyName = "*** Key name ****";

        try {
            AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                    .withCredentials(new ProfileCredentialsProvider())
                    .withRegion(clientRegion)
                    .build();

            s3Client.deleteObject(new DeleteObjectRequest(bucketName, keyName));
        }
        catch(AmazonServiceException e) {
            // The call was transmitted successfully, but Amazon S3 couldn't process
            // it, so it returned an error response.
            e.printStackTrace();
        }
        catch(SdkClientException e) {
            // Amazon S3 couldn't be contacted for a response, or the client
            // couldn't parse the response from Amazon S3.
            e.printStackTrace();
        }
    }
}

AWS Command Line Interface

The following is a sample AWS Command Line Interface (AWS CLI) command that will delete a bucket and will use the ––force parameter to remove a nonempty bucket. This command deletes all objects first and then deletes the bucket.

$ aws s3 rb s3://bucket-name ––force

Objects

You can store an unlimited number of objects within Amazon S3, but an object can only be between 1 byte to 5 TB in size. If you have objects larger than 5 TB, use a file splitter and upload the file in chunks to Amazon S3. Then reassemble them if you download the file parts for later use.

The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, you should consider using multipart upload (discussed later in this chapter). For any objects larger than 5 GB, you must use multipart upload.

Object Facets

An object consists of the following facets:

Key The key is the name that you assign to an object, which may include a simulated folder structure. Each key must be unique within a bucket (unless the bucket has versioning turned on).

Amazon S3 URLs can be thought of as a basic data map between “bucket + key + version” and the web service endpoint. For example, in the URL http://doc.s3.amazonaws.com/ 2006-03-01/AmazonS3.wsdl, doc is the name of the bucket and 2006-03-01/AmazonS3.wsdl is the key.

Version ID Within a bucket, a key and version ID uniquely identify an object. If versioning is turned off, you have only a single version. If versioning is turned on, you may have multiple versions of a stored object.

Value The value is the actual content that you are storing. An object value can be any sequence of bytes, and objects can range in size from 1 byte up to 5 TB.

Metadata Metadata is a set of name-value pairs with which you can store information regarding the object. You can assign metadata, referred to as user-defined metadata, to your objects in Amazon S3. Amazon S3 also assigns system metadata to these objects, which it uses for managing objects.

Subresources Amazon S3 uses the subresource mechanism to store additional object-specific information. Because subresources are subordinates to objects, they are always associated with some other entity such as an object or a bucket. The subresources associated with Amazon S3 objects can include the following:

Access control list (ACL) A list of grants identifying the grantees and the permissions granted.

Torrent Returns the torrent file associated with the specific object.

Access Control Information You can control access to the objects you store in Amazon S3. Amazon S3 supports both resource-based access control, such as an ACL and bucket policies, and user-based access control.

Object Tagging

Object tagging enables you to categorize storage. Each tag is a key-value pair. Consider the following tagging examples.

Suppose an object contains protected health information (PHI) data. You can tag the object using the following key-value pair:

PHI=True

Classification=PHI

While it is acceptable to use tags to label objects containing confidential data (such as personally identifiable information (PII) or PHI), the tags themselves should not contain any confidential information.

Suppose that you store project files in your Amazon S3 bucket. You can tag these objects with a key called Project and a value, as shown here:

Project=Blue

You can add multiple tags to a single object, such as the following:

Project=SalesForecast2018
Classification=confidential

You can tag new objects when you upload them, or you can add them to existing objects.

Note the following limitations when working with tagging:

You can associate 10 tags with an object, and each tag associated with an object must have unique tag keys.
A tag key can be up to 128 Unicode characters in length, and tag values can be up to 256 Unicode characters in length.

Keys and values are case sensitive.

Developers commonly categorize their files in a folder-like structure in the key name (remember, Amazon S3 has a flat file structure), such as the following:

photos/photo1.jpg
project/projectx/document.pdf
project/projecty/document2.pdf

This allows you to have only one-dimensional categorization, meaning that everything under a prefix is one category.

With tagging, you now have another dimension. If your photo1 is in project x category, tag the object accordingly. In addition to data classification, tagging offers the following benefits:

Object tags enable fine-grained access control of permissions. For example, you could grant an AWS Identity and Access Management (IAM) user permissions to read-only objects with specific tags.
Object tags enable fine-grained object lifecycle management in which you can specify a tag-based filter, in addition to key name prefix, in a lifecycle rule.
When using Amazon S3 analytics, you can configure filters to group objects together for analysis by object tags, by key name prefix, or by both prefix and tags.
You can also customize Amazon CloudWatch metrics to display information by specific tag filters. The following sections provide details.

Cross-Origin Resource Sharing

Cross-Origin Resource Sharing (CORS) defines a way for client web applications that are loaded in one domain to interact with resources in a different domain. With CORS support in Amazon S3, you can build client-side web applications with Amazon S3 and selectively allow cross-origin access to your Amazon S3 resources while avoiding the need to use a proxy.

Cross-Origin Request Scenario

Suppose that you are hosting a website in an Amazon S3 bucket named website on Amazon S3. Your users load the website endpoint: http://website.s3-website-us-east-1.amazonaws.com.

Your website will use JavaScript on the web pages that are stored in this bucket to be able to make authenticated GET and PUT requests against the same bucket by using the Amazon S3 API endpoint for the bucket: website.s3.amazonaws.com.

A browser would normally block JavaScript from allowing those requests, but with CORS, you can configure your bucket to enable cross-origin requests explicitly from website.s3-website-us-east-1.amazonaws.com.

Suppose that you host a web font from your Amazon S3 bucket. Browsers require a CORS check (also referred as a preflight check) for loading web fonts, so you would configure the bucket that is hosting the web font to allow any origin to make these requests.

There are no coding exercises as part of the exam, but these case studies can help you visualize how to use Amazon S3 and CORS.

Operations on Objects

Write an Object

This sample Java code shows how to add an object to a bucket:

import boto3

# Create an S3 client
s3 = boto3.client('s3')

filename = 'file.txt'
bucket_name = 'my-bucket'

# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)

Reading Objects

The following Java code example demonstrates getting a stream on the object data of a particular object and processing the response:

AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
S3Object object = s3Client.getObject(
                  new GetObjectRequest(bucketName, key));
InputStream objectData = object.getObjectContent();
// Process the objectData stream.
objectData.close();

Deleting Objects

You can delete one or more objects directly from Amazon S3. You have the following options when deleting an object:

Delete a Single Object Amazon S3 provides the DELETE API to delete one object in a single HTTP request.

Delete Multiple Objects Amazon S3 also provides the Multi-Object Delete API to delete up to 1,000 objects in a single HTTP request.

The following Java sample demonstrates deleting an object by providing the bucket name and key name:

AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
s3client.deleteObject(new DeleteObjectRequest(bucketName, keyName));

This next Java sample demonstrates deleting a versioned object by providing a bucket name, object key, and a version ID:

AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
s3client.deleteObject(new DeleteVersionRequest(bucketName, keyName, versionId));

List Keys The following Java code example lists object keys in a bucket:

private static String bucketName = "***bucket name***";
AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
System.out.println("Listing objects");
            final ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withMaxKeys(2);
            ListObjectsV2Result result;
            do {
               result = s3client.listObjectsV2(req);

               for (S3ObjectSummary objectSummary :
                   result.getObjectSummaries()) {
                   System.out.println(" - " + objectSummary.getKey() + "  " +
                           "(size = " + objectSummary.getSize() +
                           ")");
                       }
System.out.println("Next Continuation Token : " +  result.getNextContinuationToken());

req.setContinuationToken(result.getNextContinuationToken());
            } while(result.isTruncated() == true );

Storage Classes

There are several different storage classes from which to choose when using Amazon S3. Your choice will depend on your level of need for durability, availability, and performance for your application.

Amazon S3 Standard

Amazon S3 Standard offers high-durability, high-availability, and performance-object storage for frequently accessed data. Because it delivers low latency and high throughput, Amazon S3 Standard is ideal for a wide variety of use cases, including the following:

Cloud applications
Dynamic websites
Content distribution
Mobile and gaming applications
Big data analytics

Amazon S3 Standard is designed to achieve durability of 99.999999999 percent of objects (designed to sustain the loss of data in two facilities) and availability of 99.99 percent over a given year (which is backed by the Amazon S3 Service Level Agreement).

Essentially, the data in Amazon S3 is spread out over multiple facilities within a region. You can lose access to two facilities and still have access to your files.

Reduced Redundancy Storage

Reduced Redundancy Storage (RRS) (or Reduced_Redundancy) is an Amazon S3 storage option that enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon S3 Standard storage. It provides a highly available solution for distributing or sharing content that is durably stored elsewhere or for objects that can easily be regenerated, such as thumbnails or transcoded media.

The RRS option stores objects on multiple devices across multiple facilities, providing 400 times the durability of a typical disk drive, but it does not replicate objects as many times as Amazon S3 Standard storage.

RRS is designed to achieve availability of 99.99 percent (same as Amazon S3 Standard) and durability of 99.99 percent (designed to sustain the loss of data in a single facility).

Amazon S3 Standard-Infrequent Access

Amazon S3 Standard-Infrequent Access (Standard_IA) is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. It offers the same high durability, throughput, and low latency of Amazon S3 Standard, but it has a lower per-gigabyte storage price and per-gigabyte retrieval fee.

The ideal use cases for using Standard_IA include the following:

Long-term storage
Backups
Data stores for disaster recovery

Standard_IA is set at the object level and can exist in the same bucket as Amazon S3 Standard, allowing you to use lifecycle policies to transition objects automatically between storage classes without any application changes.

Standard_IA is designed to achieve availability of 99.9 percent (but low retrieval time) and durability of 99.999999999 percent of objects over a given year (same as Amazon S3 Standard).

Amazon S3 One Zone-Infrequent Access

Amazon S3 One Zone-Infrequent Access (OneZone_IA) is similar to Amazon S3 Standard-IA. The difference is that the data is stored only in a single Availability Zone instead of a minimum of three Availability Zones. Because of this, storing data in OneZone_IA costs 20 percent less than storing it in Standard_IA. Because of this approach, however, any data stored in this storage class will be permanently lost in the event of an Availability Zone destruction.

Amazon Simple Storage Service Glacier

Amazon Simple Storage Service Glacier (Amazon S3 Glacier) is a secure, durable, and extremely low-cost storage service for data archiving that offers the same high durability as Amazon S3. Unlike Amazon S3 Standard’s immediate retrieval times, Amazon S3 Glacier’s retrieval times run from a few minutes to several hours.

To keep costs low, Amazon S3 Glacier provides three archive access speeds, ranging from minutes to hours. This allows you to choose an option that will meet your recovery time objective (RTO) for backups in your disaster recovery plan.

Amazon S3 Glacier can also be used to secure archives that need to be kept due to a compliance policy. For example, you may need to keep certain records for seven years before deletion and only need access during an audit. Amazon S3 Glacier allows redundancy in your files when audits do occur, but at an extremely low cost in exchange for slower access.

Vaults

Amazon S3 Glacier uses vaults as containers to store archives. You can view a list of your vaults in the AWS Management Console and use the AWS software development kits (SDKs) to perform a variety of vault operations, such as the following:

Create vault
Delete vault
Lock vault
List vault metadata
Retrieve vault inventory
Tag vaults for filtering
Configure vault notifications

You can also set access policies for each vault to grant or deny specific activities to users. You can have up to 1,000 vaults per AWS account.

Amazon S3 Glacier provides a management console to create and delete vaults. All other interactions with Amazon S3 Glacier, however, require that you use the AWS CLI or write code.

Vault Lock

Amazon S3 Glacier Vault Lock allows you to deploy and enforce compliance controls easily on individual Amazon S3 Glacier vaults via a lockable policy. You can specify controls such as Write Once Read Many (WORM) in a Vault Lock policy and lock the policy from future edits. Once locked, the policy becomes immutable, and Amazon S3 Glacier will enforce the prescribed controls to help achieve your compliance objectives.

Once you initiate a lock, you have 24 hours to validate your lock policy to ensure that it is working as you intended. Until that 24 hours is up, you can abort the lock and make changes. After 24 hours, that Vault Lock is permanent, and you will not be able to change it.

Maintaining Client-Side Archive Metadata

Except for the optional archive description, Amazon S3 Glacier does not support any additional metadata for the archives. When you upload an archive, Amazon S3 Glacier assigns an ID—an opaque sequence of characters—from which you cannot infer any meaning about the archive. Metadata about the archives can be maintained on the client side. The metadata can include identifying archive information such as the archive name.

If you use Amazon S3, when you upload an object to a bucket, you can assign the object an object key such as MyDocument.txt or SomePhoto.jpg. In Amazon S3 Glacier, you cannot assign a key name to the archives you upload.

If you maintain client-side archive metadata, note that Amazon S3 Glacier maintains a vault inventory that includes archive IDs and any descriptions that you provided during the archive upload. We recommend that you occasionally download the vault inventory to reconcile any issues in the client-side database that you maintain for the archive metadata. Amazon S3 Glacier takes vault inventory approximately once a day. When you request a vault inventory, Amazon S3 Glacier returns the last inventory it prepared, which is a point-in-time snapshot.

Using the AWS SDKs with Amazon S3 Glacier

AWS provides SDKs for you to develop applications for Amazon S3 Glacier in various programming languages.

The AWS SDKs for Java and .NET offer both high-level and low-level wrapper libraries. The SDK libraries wrap the underlying Amazon S3 Glacier API, simplifying your programming tasks. The low-level wrapper libraries map closely to the underlying REST API supported by Amazon S3 Glacier. To simplify application development further, these SDKs also offer a higher-level abstraction for some of the operations in the high-level API. For example, when uploading an archive using the low-level API, if you need to provide a checksum of the payload, the high-level API computes the checksum for you.

Encryption

All data in Amazon S3 Glacier will be encrypted on the server side using key management and key protection, which Amazon S3 Glacier handles using AES-256 encryption. If you want, you can manage your own keys and encrypt the data prior to uploading.

Restoring Objects from Amazon S3 Glacier

Objects in the Amazon S3 Glacier storage class are not immediately accessible and cannot be retrieved via copy/paste once they have been moved to Amazon S3 Glacier.

Remember that Amazon S3 Glacier charges a retrieval fee for retrieving objects. When you restore an archive, you pay for both the archive and the restored copy. Because there is a storage cost for the copy, restore objects only for the duration that you need them. If you need a permanent copy of the object, create a copy of it in your Amazon S3 bucket.

Archive Retrieval Options

There are several different options for restoring archived objects from Amazon S3 Glacier to Amazon S3, as shown in Table 3.4.

Table 3.4 Amazon S3 Glacier Archive Retrieval Options

Retrieval Option	Retrieval Time	Note
Expedited retrieval	1–5 minutes
On-Demand		Processed immediately the vast majority of the time. During high demand, may fail to process, and you will be required to repeat the request.
Provisioned		Guaranteed to process immediately. After purchasing provisioned capacity, all of your retrievals are processed in this manner.
Standard retrieval	3–5 hours
Bulk retrieval	5–12 hours	Lowest-cost option

Do not use Amazon S3 Glacier for backups if your RTO is shorter than the lowest Amazon S3 Glacier retrieval time for your chosen retrieval option. For example, if your RTO requires data retrieval of two hours in a disaster recovery scenario, then Amazon S3 Glacier standard retrieval will not meet your RTO.

Storage Class Comparison

Table 3.5 shows a comparison of the Amazon S3 storage classes. This is an important table for the certification exam. Many storage decision questions on the exam center on the level of durability, availability, and cost. The table’s comparisons can help you make the right choice for a question, in addition to understanding trade-offs when choosing a data store for an application.

Table 3.5 Amazon S3 Storage Class Comparison

	Standard	Standard_IA	OneZone_IA	Amazon S3 Glacier
Designed for durability	99.999999999%*
Designed for availability	99.99%	99.9%	99.%%	N/A
Availability SLAs	99.9%	99%		N/A
Availability zones	≥3		1	≥3
Minimum capacity charge per object	N/A	128 KB*		N/A
Minimum storage duration charge	N/A	30 days		90 days
Retrieval fee	N/A	Per GB retrieved*
First byte latency	milliseconds			Minutes or hours*
Storage type	Object
Lifecycle transitions	Yes
^* Because One Zone_IA stores data in a single Availability Zone, data stored in this storage class will be lost in the event of Availability Zone destruction. Standard_IA has a minimum object size of 128 KB. Smaller objects will be charged for 128 KB of storage. Amazon S3 Glacier allows you to select from multiple retrieval tiers based upon your needs.

Data Consistency Model

When deciding whether to choose Amazon S3 or Amazon EBS for your application, one important aspect to consider is the consistency model of the storage option. Amazon EBS provides read-after-write consistency for all operations, whereas Amazon S3 provides read-after-write consistency only for PUTs of new objects.

Amazon S3 offers eventual consistency for overwrite PUTs and DELETEs in all regions, and updates to a single key are atomic. For example, if you PUT an object to update an existing object and immediately attempt to read that object, you may read either the old data or the new data.

For PUT operations with new objects not yet in Amazon S3, you will experience read-after-write consistency. For PUT updates when you are overwriting an existing file or DELETE operations, you will experience eventual consistency.

Amazon S3 does not currently support object locking. If two PUT requests are simultaneously made to the same key, the request with the latest time stamp wins. If this is an issue, you will be required to build an object locking mechanism into your application.

You may be wondering why Amazon S3 was designed with this style of consistency. The consistency, availability, and partition tolerance theorem (CAP theorem) states that you can highly achieve only two out of the three dimensions for a particular storage design. The CAP theorem is shown in Figure 3.6.

The figure shows a Venn diagram illustrating the consistency, availability, and partition tolerance theorem (CAP theorem). First circle is labeled “consistency,” second circle is labeled “availability” and third circle is labeled “partition tolerance” (anticlockwise direction). The overlapped portions are labeled “C + A,” “C + P” and “A + P.” — **Figure 3.6** CAP theorem

Think of partition tolerance in this equation as the storage durability. Amazon S3 was designed for high availability and high durability (multiple copies across multiple facilities), so the design trade-off is the consistency. When you PUT an object, you are not only putting the object into one location but into three, meaning that there is either a slightly increased latency on the read-after-write consistency of a PUT or eventual consistency on the PUT update or DELETE operations while Amazon S3 reconciles all copies. You do not know, for instance, which facility a file is coming from when you GET an object. If you had recently written an object, it may have propagated to only two facilities, so when you try to read the object right after your PUT, you may receive the old object or the new object.

Concurrent Applications

As a developer, it is critical to consider the way your application works and the consistency needs of your files. If your application requires read-after-write consistency on all operations, then Amazon S3 is not going to be the right choice for that application. If you are working with concurrent applications, it is important to know how your application performs PUT, GET, and DELETE operations concurrently to know whether eventual consistency will not be the right choice for your application.

In Figure 3.7, Amazon S3, both W1 (write 1) and W2 (write 2) complete before the start of R1 (read 1) and R2 (read 2). For a consistent read, R1 and R2 both return color = ruby. For an eventually consistent read, R1 and R2 might return color = red, color = ruby, or no results, depending on the amount of time that has elapsed.

The figure shows consistency example 1. — **Figure 3.7** Consistency example 1

In Figure 3.8, W2 does not complete before the start of R1. Therefore, R1 might return color = ruby or color = garnet for either a consistent read or an eventually consistent read. Depending on the amount of time that has elapsed, an eventually consistent read might also return no results.

The figure shows consistency example 2. — **Figure 3.8** Consistency example 2

For a consistent read, R2 returns color = garnet. For an eventually consistent read, R2 might return color = ruby, color = garnet, or no results depending on the amount of time that has elapsed.

In Figure 3.9, client 2 performs W2 before Amazon S3 returns a success for W1, so the outcome of the final value is unknown (color = garnet or color = brick). Any subsequent reads (consistent read or eventually consistent) might return either value. Depending on the amount of time that has elapsed, an eventually consistent read might also return no results.

The figure shows consistency example 3. — **Figure 3.9** Consistency example 3

If you need a strongly consistent data store, choose a different data store than Amazon S3 or code consistency checks into your application.

Presigned URLs

A presigned URL is a way to grant access to an object. One way that developers use presigned URLs is to allow users to upload or download objects without granting them direct access to Amazon S3 or the account.

For example, if you need to send a document hosted in an Amazon S3 bucket to an external reviewer who is outside of your organization, you do not want to grant them access using IAM to your bucket or objects. Instead, generate a presigned URL to the object and send that to the user to download your file.

Another example is if you need someone external to your organization to upload a file. Maybe a media company is designing the graphics for the website you are developing. You can create a presigned URL for them to upload their artifacts directly to Amazon S3 without granting them access to your Amazon S3 bucket or account.

Anyone with valid security credentials can create a presigned URL. For you to upload an object successfully, however, the presigned URL must be created by someone who has permission to perform the operation upon which the presigned URL is based.

The following Java code example demonstrates generating a presigned URL:

AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());

java.util.Date expiration = new java.util.Date();
long msec = expiration.getTime();
msec += 1000 * 60 * 60; // Add 1 hour.
expiration.setTime(msec);

GeneratePresignedUrlRequest generatePresignedUrlRequest = new  GeneratePresignedUrlRequest(bucketName, objectKey);
generatePresignedUrlRequest.setMethod(HttpMethod.PUT);
generatePresignedUrlRequest.setExpiration(expiration);

URL s = s3client.generatePresignedUrl(generatePresignedUrlRequest);

// Use the pre-signed URL to upload an object.

Amazon S3 presigned URLs cannot be generated within the AWS Management Console, but they can be generated using the AWS CLI or AWS SDKs.

Encryption

Data protection refers to protecting data while in transit (as it travels to and from Amazon S3) and at rest (while it is stored on Amazon S3 infrastructure). As a best practice, all sensitive data stored in Amazon S3 should be encrypted, both at rest and in transit.

You can protect data in transit by using Amazon S3 SSL API endpoints, which ensures that all data sent to and from Amazon S3 is encrypted using the HTTPS protocol while in transit.

For data at rest in Amazon S3, you can encrypt it using different options of Server-Side Encryption (SSE). Your objects in Amazon S3 are encrypted at the object level as they are written to disk in the data centers and then decrypted for you when you access the objects using AES-256.

You can also use client-side encryption, with which you encrypt the objects before uploading to Amazon S3 and then decrypt them after you have downloaded them. Some customers, for some workloads, will use a combination of both server-side and client-side encryption for extra protection.

Envelope Encryption Concepts

Before examining the different types of encryption available, we will review envelope encryption, which several AWS services use to provide a balance between performance and security.

The following steps describe how envelope encryption works:

A data key is generated by the AWS service at the time you request your data to be encrypted, as shown in Figure 3.10.

Figure 3.10 Generating a data key
The data key generated in step 1 is used to encrypt your data, as shown in Figure 3.11.

Figure 3.11 Encrypting the data
The data key is then encrypted with a key-encrypting key unique to the service storing your data, as shown in Figure 3.12.

Figure 3.12 Encrypted data key
The encrypted data key and the encrypted data are then stored by the AWS storage service (such as Amazon S3 or Amazon EBS) on your behalf. This is shown in Figure 3.13.

The figure shows how to encrypt the data and key storage. — **Figure 3.13** Encrypted data and key storage

When you need access to your plain-text data, this process is reversed. The encrypted data key is decrypted using the key-encrypting key, and the data key is then used to decrypt your data.

The important point to remember regarding envelope encryption is that the key-encrypting keys used to encrypt data keys are stored and managed separately from the data and the data keys.

Server-Side Encryption (SSE)

You have three, mutually exclusive options for how you choose to manage your encryption keys when using SSE with Amazon S3.

SSE-S3 (Amazon S3 managed keys) You can set an API flag or check a box in the AWS Management Console to have data encrypted before it is written to disk in Amazon S3. Each object is encrypted with a unique data key. As an additional safeguard, this key is encrypted with a periodically-rotated master key managed by Amazon S3. AES-256 is used for both object and master keys. This feature is offered at no additional cost beyond what you pay for using Amazon S3.

SSE-C (Customer-provided keys) You can use your own encryption key while uploading an object to Amazon S3. This encryption key is used by Amazon S3 to encrypt your data using AES-256. After the object is encrypted, the encryption key you supplied is deleted from the Amazon S3 system that used it to encrypt your data. When you retrieve this object from Amazon S3, you must provide the same encryption key in your request. Amazon S3 verifies that the encryption key matches, decrypts the object, and returns the object to you. This feature is also offered at no additional cost beyond what you pay for using Amazon S3.

SSE-KMS (AWS KMS managed encryption keys) You can encrypt your data in Amazon S3 by defining an AWS KMS master key within your account to encrypt the unique object key (referred to as a data key) that will ultimately encrypt your object. When you upload your object, a request is sent to AWS KMS to create an object key. AWS KMS generates this object key and encrypts it using the master key that you specified earlier. Then, AWS KMS returns this encrypted object key along with the plaintext object key to Amazon S3. The Amazon S3 web server encrypts your object using the plaintext object key and stores the now encrypted object (with the encrypted object key) and deletes the plaintext object key from memory.

To retrieve this encrypted object, Amazon S3 sends the encrypted object key to AWS KMS, which then decrypts the object key using the correct master key and returns the decrypted (plaintext) object key to Amazon S3. With the plaintext object key, Amazon S3 decrypts the encrypted object and returns it to you. Unlike SSE-S3 and SSE-C, using SSE-KMS does incur an additional charge. Refer to the AWS KMS pricing page on the AWS website for more information.

For maximum simplicity and ease of use, use SSE with AWS managed keys (SSE-S3 or SSE-KMS). Also, know the difference between SSE-S3, SSE-KMS, and SSE-C for SSE.

Client-Side Encryption

Client-side encryption refers to encrypting your data before sending it to Amazon S3. You have two options for using data encryption keys.

Client-Side Master Key

The first option is to use a client-side master key of your own. When uploading an object, you provide a client-side master key to the Amazon S3 encryption client (for example, AmazonS3EncryptionClient when using the AWS SDK for Java). The client uses this master key only to encrypt the data encryption key that it generates randomly. When downloading an object, the client first downloads the encrypted object from Amazon S3 along with the metadata. Using the material description in the metadata, the client first determines which master key to use to decrypt the encrypted data key. Then the client uses that master key to decrypt the data key and uses it to decrypt the object. The client-side master key that you provide can be either a symmetric key or a public/private key pair.

The process works as follows:

The Amazon S3 encryption client locally generates a one-time-use symmetric key (also known as a data encryption key or data key) and uses this data key to encrypt the data of a single Amazon S3 object (for each object, the client generates a separate data key).
The client encrypts the data encryption key using the master key that you provide.
The client uploads the encrypted data key and its material description as part of the object metadata. The material description helps the client later determine which client-side master key to use for decryption (when you download the object, the client decrypts it).
The client then uploads the encrypted data to Amazon S3 and also saves the encrypted data key as object metadata (x-amz-meta-x-amz-key) in Amazon S3 by default.

AWS KMS-Managed Customer Master Key (CMK)

The second option is to use an AWS KMS managed customer master key (CMK). This process is similar to the process described earlier for using KMS-SSE, except that it is used for data at rest instead of data in transit. There is an Amazon S3 encryption client in the AWS SDK for Java.

Using an AWS KMS Managed CMK (AWS SDK for Java)

import java.io.ByteArrayInputStream;
import java.util.Arrays;

import junit.framework.Assert;

import org.apache.commons.io.IOUtils;

import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Region;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3EncryptionClient;
import com.amazonaws.services.s3.model.CryptoConfiguration;
import com.amazonaws.services.s3.model.KMSEncryptionMaterialsProvider;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3Object;

public class testKMSkeyUploadObject {

    private static AmazonS3EncryptionClient encryptionClient;

    public static void main(String[] args) throws Exception {
       String bucketName = "***bucket name***";
        String objectKey  = "ExampleKMSEncryptedObject";
        String kms_cmk_id = "***AWS KMS customer master key ID***";

        KMSEncryptionMaterialsProvider materialProvider = new  KMSEncryptionMaterialsProvider(kms_cmk_id);

        encryptionClient = new AmazonS3EncryptionClient(new ProfileCredentials Provider(), materialProvider,
                new CryptoConfiguration().withKmsRegion(Regions.US_EAST_1))
            .withRegion(Region.getRegion(Regions.US_EAST_1));

        // Upload object using the encryption client.

        byte[] plaintext = "Hello World, S3 Client-side Encryption Using Asymmetric Master Key!"

                .getBytes();
        System.out.println("plaintext's length: " + plaintext.length);
        encryptionClient.putObject(new PutObjectRequest(bucketName, objectKey,
                new ByteArrayInputStream(plaintext), new ObjectMetadata()));

     // Download the object.

        S3Object downloadedObject = encryptionClient.getObject(bucketName,
                objectKey);
        byte[] decrypted = IOUtils.toByteArray(downloadedObject
                .getObjectContent());

        // Verify same data.

        Assert.assertTrue(Arrays.equals(plaintext, decrypted));
    }
}

Know the difference between CMK and client-side master keys for client-side encryption.

Access Control

By default, all Amazon S3 resources—buckets, objects, and related sub-resources (for example, lifecycle configuration and website configuration)—are private. Only the resource owner, an account that created it, can access the resource. The resource owner can optionally grant access permissions to others by writing an access policy.

Amazon S3 offers access policy options broadly categorized as resource-based policies and user policies. Access policies that you attach to your resources (buckets and objects) are referred to as resource-based policies. For example, bucket policies and ACLs are resource-based policies. You can also attach access policies to users in your account. These are called user policies. You can choose to use resource-based policies, user policies, or some combination of both to manage permissions to your Amazon S3 resources. The following sections provide general guidelines for managing permissions.

Using Bucket Policies and User Policies

Bucket policy and user policy are two of the access policy options available for you to grant permissions to your Amazon S3 resources. Both use a JSON-based access policy language, as do all AWS services that use policies.

A bucket policy is attached only to Amazon S3 buckets, and it specifies what actions are allowed or denied for whichever principals on the bucket to which the bucket policy is attached (for instance, allow user Alice to PUT but not DELETE objects in the bucket).

A user policy is attached to IAM users to perform or not perform actions on your AWS resources. For example, you may choose to grant an IAM user in your account access to one of your buckets and allow the user to add, update, and delete objects. You can grant them access with a user policy.

Now we will discuss the differences between IAM policies and Amazon S3 bucket policies. Both are used for access control, and they are both written in JSON using the AWS access policy language. However, unlike Amazon S3 bucket policies, IAM policies specify what actions are allowed or denied on what AWS resources (such as, allow ec2:TerminateInstance on the Amazon EC2 instance with instance_id=i8b3620ec). You attach IAM policies to IAM users, groups, or roles, which are then subject to the permissions that you have defined. Instead of attaching policies to the users, groups, or roles, bucket policies are attached to a specific resource, such as an Amazon S3 bucket.

Managing Access with Access Control Lists

Access with access control lists (ACLs) are resource-based access policies that you can use to manage access to your buckets and objects, including granting basic read/write permissions to other accounts.

There are limits to managing permissions using ACLs. For example, you can grant permissions only to other accounts; you cannot grant permissions to users in your account. You cannot grant conditional permissions, nor can you explicitly deny permissions using ACLs.

ACLs are suitable only for specific scenarios (for example, if a bucket owner allows other accounts to upload objects), and permissions to these objects can be managed only using an object ACL by the account that owns the object.

You can only grant access to other accounts using ACLs—not users in your own account.

Defense in Depth—Amazon S3 Security

Amazon S3 provides comprehensive security and compliance capabilities that meet the most stringent regulatory requirements, and it gives you flexibility in the way that you manage data for cost optimization, access control, and compliance. With this flexibility, however, comes the responsibility of ensuring that your content is secure.

You can use an approach known as defense in depth in Amazon S3 to secure your data. This approach uses multiple layers of security to ensure redundancy if one of the multiple layers of security fails.

Figure 3.14 represents defense in depth visually. It contains several Amazon S3 objects (A) in a single Amazon S3 bucket (B). You can encrypt these objects on the server side or the client side, and you can also configure the bucket policy such that objects are accessible only through Amazon CloudFront, which you can accomplish through an origin access identity (C). You can then configure Amazon CloudFront to deliver content only over HTTPS in addition to using your own domain name (D).

The figure shows defense in depth on Amazon S3. — **Figure 3.14** Defense in depth on Amazon S3

To meet defense in depth requirements on Amazon S3:

Data must be encrypted at rest and during transit.
Data must be accessible only by a limited set of public IP addresses.
Data must not be publicly accessible directly from an Amazon S3 URL.
A domain name is required to consume the content.

You can apply policies to Amazon S3 buckets so that only users with appropriate permissions are allowed to access the buckets. Anonymous users (with public-read/public-read-write permissions) and authenticated users without the appropriate permissions are prevented from accessing the buckets.

You can also secure access to objects in Amazon S3 buckets. The objects in Amazon S3 buckets can be encrypted at rest and during transit to provide end-to-end security from the source (in this case, Amazon S3) to your users.

Query String Authentication

You can provide authentication information using query string parameters. Using query parameters to authenticate requests is useful when expressing a request entirely in a URL. This method is also referred to as presigning a URL.

With presigned URLs, you can grant temporary access to your Amazon S3 resources. For example, you can embed a presigned URL on your website, or alternatively use it in a command line client (such as Curl), to download objects.

The following is an example presigned URL:

https://s3.amazonaws.com/examplebucket/test.txt

?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=<your-access-key-id>/20130721/us-east-1/s3/aws4_request
&X-Amz-Date=20130721T201207Z
&X-Amz-Expires=86400
&X-Amz-SignedHeaders=host
&X-Amz-Signature=<signature-value>

In the example URL, note the following:

The line feeds are added for readability.
The X-Amz-Credential value in the URL shows the / character only for readability. In practice, it should be encoded as %2F.

Hosting a Static Website

If your website contains static content and optionally client-side scripts, then you can host your static website directly in Amazon S3 without the use of web-hosting servers.

To host a static website, you configure an Amazon S3 bucket for website hosting and upload your website content to the bucket. The website is then available at the AWS Region–specific website endpoint of the bucket in one of the following formats:

<bucket-name>.s3-website-<AWS-region>.amazonaws.com
<bucket-name>.s3-website.<AWS-region>.amazonaws.com

Instead of accessing the website by using an Amazon S3 website endpoint, use your own domain (for instance, example.com) to serve your content. The following steps allow you to configure your own domain:

Register your domain with the registrar of your choice. You can use Amazon Route 53 to register your domain name or any other third-party domain registrar.
Create your bucket in Amazon S3 and upload your static website content.
Point your domain to your Amazon S3 bucket using either of the following as your DNS provider:
- Amazon Route 53
- Your third-party domain name registrar

Amazon S3 does not support server-side scripting or dynamic content. We discuss other AWS options for that throughout this study guide.

Static websites can be hosted in Amazon S3.

MFA Delete

MFA is another way to control deletes on your objects in Amazon S3. It does so by adding another layer of protection against unintentional or malicious deletes, requiring an authorized request against Amazon S3 to delete the object. MFA also requires a unique code from a token or an authentication device (virtual or hardware). These devices provide a unique code that will then allow you to delete the object. Figure 3.15 shows what would be required for a user to execute a delete operation on an object when MFA is enabled.

The figure shows what would be required for a user to execute a delete operation on an object when MFA is enabled. A rectangular box labeled “Requires two forms of authentication:” shows two different icons. The icon on the left-hand side is labeled “Your security credentials (what you know)” and the icon on the left-hand side is labeled “Unique code from an approved authentication device ((what you have). — **Figure 3.15** MFA Delete

Cross-Region Replication

Cross-region replication (CRR) is a bucket-level configuration that enables automatic, asynchronous copying of objects across buckets in different AWS Regions. We refer to these buckets as the source bucket and destination bucket. These buckets can be owned by different accounts.

To activate this feature, add a replication configuration to your source bucket to direct Amazon S3 to replicate objects according to the configuration. In the replication configuration, provide information including the following:

The destination bucket
The objects that need to be replicated
Optionally, the destination storage class (otherwise the source storage class will be used)

The replicas that are created in the destination bucket will have these same characteristics as the source objects:

Key names
Metadata
Storage class (unless otherwise specified)
Object ACL

All data is encrypted in transit across AWS Regions using SSL.

You can replicate objects from a source bucket to only one destination bucket. After Amazon S3 replicates an object, the object cannot be replicated again. For example, even after you change the destination bucket in an existing replication configuration, Amazon S3 will not replicate it again.

After Amazon S3 replicates an object using CRR, the object cannot be replicated again (such as to another destination bucket).

Requirements for CRR include the following:

Versioning is enabled for both the source and destination buckets.
Source and destination buckets must be in different AWS Regions.
Amazon S3 must be granted appropriate permissions to replicate files.

VPC Endpoints

A virtual private cloud (VPC) endpoint enables you to connect your VPC privately to Amazon S3 without requiring an internet gateway, network address translation (NAT) device, virtual private network (VPN) connection, or AWS Direct Connect connection. Instances in your VPC do not require public IP addresses to communicate with the resources in the service. Traffic between your VPC and Amazon S3 does not leave the Amazon network.

Amazon S3 uses a gateway type of VPC endpoint. The gateway is a target for a specified route in your route table, used for traffic destined for a supported AWS service. These endpoints are easy to configure, are highly reliable, and provide a secure connection to Amazon S3 that does not require a gateway or NAT instance.

Amazon EC2 instances running in private subnets of a VPC can have controlled access to Amazon S3 buckets, objects, and API functions that are in the same region as the VPC. You can use an Amazon S3 bucket policy to indicate which VPCs and which VPC endpoints have access to your Amazon S3 buckets.

Using the AWS SDKs, AWS CLI, and AWS Explorers

You can use the AWS SDKs when developing applications with Amazon S3. The AWS SDKs simplify your programming tasks by wrapping the underlying REST API. The AWS Mobile SDKs and the AWS Amplify JavaScript library are also available for building connected mobile and web applications using AWS. In addition to AWS SDKs, AWS explorers are available for Visual Studio and Eclipse for Java Integrated Development Environment (IDE). In this case, the SDKs and AWS explorers are available bundled together as AWS Toolkits. You can also use the AWS CLI to manage Amazon S3 buckets and objects.

AWS has deprecated SOAP support over HTTP, but it is still available over HTTPS. New Amazon S3 features will not be supported over SOAP. We recommend that you use either the REST API or the AWS SDKs for any new development and migrate any existing SOAP calls when you are able.

Making Requests

Every interaction with Amazon S3 is either authenticated or anonymous. Authentication is the process of verifying the identity of the requester trying to access an AWS product (you are who you say you are, and you are allowed to do what you are asking to do). Authenticated requests must include a signature value that authenticates the request sender, generated in part from the requester’s AWS access keys. If you are using the AWS SDK, the libraries compute the signature from the keys that you provide. If you make direct REST API calls in your application, however, you must write code to compute the signature and add it to the request.

Stateless and Serverless Applications

Amazon S3 provides developers with secure, durable, and highly scalable object storage that can be used to decouple storage for use in serverless applications. Developers can also use Amazon S3 for storing and sharing state in stateless applications.

Developers on AWS are regularly moving shared file storage to Amazon S3 for stateless applications. This is a common method for decoupling your compute and storage and increasing the ability to scale your application by decoupling that storage. We will discuss stateless and serverless applications throughout this study guide.

Data Lake

Traditional data storage can no longer provide the agility and flexibility required to handle the volume, velocity, and variety of data used by today’s applications. Because of this, many organizations are shifting to a data lake architecture.

A data lake is an architectural approach that allows you to store massive amounts of data in a central location for consumption by multiple applications. Because data can be stored as is, there is no need to convert it to a predefined schema, and you no longer need to know what questions to ask of your data beforehand.

Amazon S3 is a common component of a data lake solution on the cloud, and it can complement your other storage solutions. If you move to a data lake, you are essentially separating compute and storage, meaning that you are going to build and scale your storage and compute separately. You can take storage that you currently have on premises or in your data center and instead use Amazon S3, which then allows you to scale and build your compute in any desired configuration, regardless of your storage.

That design pattern is different from most applications available today, where the storage is tied to the compute. When you separate those two features and instead use a data lake, you achieve an agility that allows you to invent new types of applications while you are managing your storage as an independent entity.

In addition, Amazon S3 lets you grow and scale in a virtually unlimited fashion. You do not have to take specific actions to expand your storage capacity—it grows automatically with your data.

In the data lake diagram shown in Figure 3.16, you will see how to use Amazon S3 as a highly available and durable central storage repository. From there, a virtually unlimited number of services and applications, both on premises and in the cloud, can take advantage of using Amazon S3 as a data lake.

Illustration shows the data lake diagram. — **Figure 3.16** Data lakes

Customers often set up a data lake as part of their migration to the cloud so that they can access their data from new applications on the cloud, migrated applications to the cloud, and applications that have not yet been migrated to the cloud.

Performance

There are a number of actions that Amazon S3 takes by default to help you achieve high levels of performance. Amazon S3 automatically scales to thousands of requests per second per prefix based on your steady state traffic. Amazon S3 will automatically partition your prefixes within hours, adjusting to increases in request rates.

Consideration for Workloads

To optimize the use of Amazon S3 mixed or GET-intensive workloads, you must become familiar with best practices for performance optimization.

Mixed request types If your requests are typically a mix of GET, PUT, DELETE, and GET Bucket (list objects), choosing appropriate key names for your objects ensures better performance by providing low-latency access to the Amazon S3 index.

GET-intensive workloads If the bulk of your workload consists of GET requests, you may want to use Amazon CloudFront, a content delivery service (discussed later in this chapter).

Tips for Object Key Naming

The way that you name your keys in Amazon S3 can affect the data access patterns, which may directly impact the performance of your application.

It is a best practice at AWS to design for performance from the start. Even though you may be developing a new application, that application is likely to grow over time. If you anticipate your application growing to more than approximately 1,000 requests per second (including both PUTs and GETs on your object), you will want to consider using a three- or four-character hash in your key names.

If you anticipate your application receiving fewer than 1,000 requests per second and you don’t see a lot of traffic in your storage, then you do not need to implement this best practice. Your application will still benefit from Amazon S3’s default performance.

In the past, customers would also add entropy in their key names. Because of recent Amazon S3 performance enhancements, most customers no longer need to worry about introducing entropy in key names.

Example 1: Random Hash

examplebucket/232a-2017-26-05-15-00-00/cust1234234/photo1.jpg
examplebucket/7b54-2017-26-05-15-00-00/cust3857422/photo2.jpg
examplebucket/921c-2017-26-05-15-00-00/cust1248473/photo2.jpg

A random hash should come before patterns, such as dates and sequential IDs.

Using a naming hash can improve the performance of heavy-traffic applications. Object keys are stored in an index in all regions. If you’re constantly writing the same key prefix over and over again (for example, a key with the current year), all of your objects will be close to each other within the same partition in the index. When your application experiences an increase in traffic, it will be trying to read from the same section of the index, resulting in decreased performance as Amazon S3 tries to spread out your data to achieve higher levels of throughput.

Always first ensure that your application can accommodate a naming hash.

By putting the hash at the beginning of your key name, you are adding randomness. You could hash the key name and place it at the beginning of your object right after the bucket name. This will ensure that your data will be spread across different partitions and allow you to grow to a higher level of throughput without experiencing a re-indexing slowdown if you go above peak traffic volumes.

Example 2: Naming Hash

examplebucket/animations/232a-2017-26-05-15-00/cust1234234/animation1.obj
examplebucket/videos/ba65-2017-26-05-15-00/cust8474937/video2.mpg
examplebucket/photos/8761-2017-26-05-15-00/cust1248473/photo3.jpg

In this second example, imagine that you are storing a lot of animations, videos, and photos in Amazon S3. If you know that you are going to have a lot of traffic to those individual prefixes, you can add your hash after the prefix. That allows you to write prefixes into your lifecycle policies or perform list API calls against a particular prefix. You are still getting the performance benefit by adding the hash to your key name, but now you can also use the prefix as necessary.

This example allows you to balance the need to list your objects and organize them against the need to spread your data across different partitions for performance.

Amazon S3 Transfer Acceleration

Amazon S3 Transfer Acceleration is a feature that optimizes throughput when transferring larger objects across larger geographic distances. Amazon S3 Transfer Acceleration uses Amazon CloudFront edge locations to assist you in uploading your objects more quickly in cases where you are closer to an edge location than to the region to which you are transferring your files.

Instead of using the public internet to upload objects from Southeast Asia, across the globe to Northern Virginia, take advantage of the global Amazon content delivery network (CDN). AWS has edge locations around the world, and you upload your data to the edge location closest to your location. This way, you are traveling across the AWS network backbone to your destination region, instead of across the public internet. This option might give you a significant performance improvement and better network consistency than the public internet.

To implement Amazon S3 Transfer Acceleration, you do not need to make any changes to your application. It is enabled by performing the following steps:

Enable Transfer Acceleration on a bucket that conforms to DNS naming requirements and does not contain periods (.).
Transfer data to and from the acceleration-enabled bucket by using one of the s3-accelerate endpoint domain names.

There is a small fee for using Transfer Acceleration. If your speed using Transfer Acceleration is no faster than it would have been going over the public internet, however, there is no additional charge.

The further you are from a particular region, the more benefit you will derive from transferring your files more quickly by uploading to a closer edge location. Figure 3.17 shows how accessing an edge location can reduce the latency for your users, as opposed to accessing content from a region that is farther away.

A world map shows how accessing an edge location can reduce the latency for your users, as opposed to accessing content from a region that is farther away. — **Figure 3.17** Using an AWS edge location

Multipart Uploads

When uploading a large object to Amazon S3 in a single-threaded manner, it can take a significant amount of time to complete. The multipart upload API enables you to upload large objects in parts to speed up your upload by doing so in parallel.

To use multipart upload, you first break the object into smaller parts, parallelize the upload, and then submit a manifest file telling Amazon S3 that all parts of the object have been uploaded. Amazon S3 will then assemble all of those individual pieces to a single Amazon S3 object.

Multipart upload can be used for objects ranging from 5 MB to 5 TB in size.

Range GETs

Range GETs are similar to multipart uploads, but in reverse. If you are downloading a large object and tracking the offsets, use range GETs to download the object as multiple parts instead of a single part. You can then download those parts in parallel and potentially see an improvement in performance.

Amazon CloudFront

Using a CDN like Amazon CloudFront, you may achieve lower latency and higher-throughput performance. You also will not experience as many requests to Amazon S3 because your content will be cached at the edge location. Your users will also experience the performance improvement of having cached storage through Amazon CloudFront versus going back to Amazon S3 for each new GET on an object.

TCP Window Scaling

Transmission Control Protocol (TCP) window scaling allows you to improve network throughput performance between your operating system, application layer, and Amazon S3 by supporting window sizes larger than 64 KB. Although it can improve performance, it can be challenging to set up correctly, so refer to the AWS Documentation repository for details.

TCP Selective Acknowledgment

TCP selective acknowledgment is designed to improve recovery time after a large number of packet losses. It is supported by most newer operating systems, but it might have to be enabled. Refer to the Amazon S3 Developer Guide for more information.

Pricing

With Amazon S3, you pay only for what you use. There is no minimum fee, and there is no charge for data transfer into Amazon S3.

You pay for the following:

The storage that you use
The API calls that you make (PUT, COPY, POST, LIST, GET)
Data transfer out of Amazon S3

Data transfer out pricing is tiered, so the more you use, the lower your cost per gigabyte. Refer to the AWS website for the latest pricing.

Amazon S3 pricing differs from the pricing of Amazon EBS volumes in that if you create an Amazon EBS volume and store nothing on it, you are still paying for the storage space of the volume that you have allocated. With Amazon S3, you pay for the storage space that is being used—not allocated.

Object Lifecycle Management

To manage your objects so that they are stored cost effectively throughout their lifecycle, use a lifecycle configuration. A lifecycle configuration is a set of rules that defines actions that Amazon S3 applies to a group of objects.

There are two types of actions:

Transition actions Transition actions define when objects transition to another storage class. For example, you might choose to transition objects to the STANDARD_IA storage class 30 days after you created them or archive objects to the GLACIER storage class one year after creating them.

Expiration actions Expiration actions define when objects expire. Amazon S3 deletes expired objects on your behalf.

When Should You Use Lifecycle Configuration?

You should use lifecycle configuration rules for objects that have a well-defined lifecycle. The following are some examples:

If you upload periodic logs to a bucket, your application might need them for a week or a month. After that, you may delete them.
Some documents are frequently accessed for a limited period of time. After that, they are infrequently accessed. At some point, you might not need real-time access to them, but your organization or regulations might require you to archive them for a specific period. After that, you may delete them.
You can upload some data to Amazon S3 primarily for archival purposes. For example, archiving digital media, financial, and healthcare records; raw genomics sequence data, long-term database backups; and data that must be retained for regulatory compliance.

With lifecycle configuration rules, you can tell Amazon S3 to transition objects to less expensive storage classes or archive or delete them.

Configuring a Lifecycle

A lifecycle configuration (an XML file) comprises a set of rules with predefined actions that you need Amazon S3 to perform on objects during their lifetime. Amazon S3 provides a set of API operations for managing lifecycle configuration on a bucket, and it is stored by Amazon S3 as a lifecycle subresource that is attached to your bucket.

You can also configure the lifecycle by using the Amazon S3 console, the AWS SDKs, or the REST API.

The following lifecycle configuration specifies a rule that applies to objects with key name prefix logs/. The rule specifies the following actions:

Two transition actions
- Transition objects to the STANDARD_IA storage class 30 days after creation
- Transition objects to the GLACIER storage class 90 days after creation
One expiration action that directs Amazon S3 to delete objects a year after creation

<LifecycleConfiguration>
  <Rule>
    <ID>example-id</ID>
    <Filter>
       <Prefix>logs/</Prefix>
    </Filter>
    <Status>Enabled</Status>
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>365</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Figure 3.18 shows a set of Amazon S3 lifecycle policies in place. These policies move files automatically from one storage class to another as they age out at certain points in time.

The figure shows a set of Amazon S3 lifecycle policies in place. — **Figure 3.18** Amazon S3 lifecycle policies

AWS File Storage Services

AWS offers Amazon Elastic File System (Amazon EFS) for file storage to enable you to share access to files that reside on the cloud.

Amazon Elastic File System

Amazon Elastic File System (Amazon EFS) provides scalable file storage and a standard file system interface for use with Amazon EC2. You can create an Amazon EFS file system, configure your instances to mount the file system, and then use an Amazon EFS file system as a common data source for workloads and application running on multiple instances.

Amazon EFS can be mounted to multiple Amazon EC2 instances simultaneously, where it can continue to expand up to petabytes while providing low latency and high throughput.

Consider using Amazon EFS instead of Amazon S3 or Amazon EBS if you have an application (Amazon EC2 or on premises) or a use case that requires a file system and any of the following:

Multi-attach
GB/s throughput
Multi-AZ availability/durability
Automatic scaling (growing/shrinking of storage)

Customers use Amazon EFS for the following use cases today:

Web serving
Database backups
Container storage
Home directories
Content management
Analytics
Media and entertainment workflows
Workflow management
Shared state management

Amazon EFS is not supported on Windows instances.

Creating your Amazon EFS File System

File System

The Amazon EFS file system is the primary resource in Amazon EFS, and it is where you store your files and directories. You can create up to 125 file systems per account.

Mount Target

To access your file system from within a VPC, create mount targets in the VPC. A mount target is a Network File System (NFS) endpoint within your VPC that includes an IP address and a DNS name, both of which you use in your mount command. A mount target is highly available, and it is illustrated in Figure 3.19.

Accessing an Amazon EFS File System

There are several different ways that you can access an Amazon EFS file system, including using Amazon EC2 and AWS Direct Connect.

Using Amazon Elastic Compute Cloud

To access a file system from an Amazon Elastic Compute Cloud (Amazon EC2) instance, you must mount the file system by using the standard Linux mount command, as shown in Figure 3.20. The file system will then appear as a local set of directories and files. An NFS v4.1 client is standard on Amazon Linux AMI distributions.

The figure shows a screenshot illustrating how to mount the file system through the standard Linux mount command. — **Figure 3.20** Mounting the file system

In your command, specify the file system type (nfs4), the version (4.1), the file system DNS name or IP address, and the user’s target directory.

A file system belongs to a region, and your Amazon EFS file system spans all Availability Zones in that region. Once you have mounted your file system, data can be accessed from any Availability Zone in the region within your VPC while maintaining full consistency. Figure 3.21 shows how you communicate with Amazon EC2 instances within a VPC.

The figure shows how to communicate with Amazon EC2 instances within a VPC. — **Figure 3.21** Using Amazon EFS

Using AWS Direct Connect

You can also mount your on-premises servers to Amazon EFS in your Amazon VPC using AWS Direct Connect. With AWS Direct Connect, you can mount your on-premises servers to Amazon EFS using the same mount command used to mount in Amazon EC2. Figure 3.22 shows how to use AWS Direct Connect with Amazon EFS.

The figure shows how to use AWS Direct Connect with Amazon EFS. — **Figure 3.22** Using AWS Direct Connect with Amazon EFS

Customers can use Amazon EFS combined with AWS Direct Connect for migration, bursting, or backup and disaster recovery.

Syncing Files Using AWS DataSync

Now that you have a functioning Amazon EFS file system, you can use AWS DataSync to synchronize files from an existing file system to Amazon EFS. AWS DataSync can synchronize your file data and also file system metadata such as ownership, time stamps, and access permissions.

To do this, download and deploy a sync agent from the Amazon EFS console as either a virtual machine (VM) image or an AMI.

Next, create a sync tack and configure your source and destination file systems. Then start your task to begin syncing the files and monitor the progress of the file sync using Amazon CloudWatch.

Performance

Amazon EFS is designed for a wide spectrum of performance needs, including the following:

High throughput and parallel I/O
Low latency and serial I/O

To support those two sets of workloads, Amazon EFS offers two different performance modes, as described here:

General purpose (default) General-purpose mode is the default mode, and it is used for latency-sensitive applications and general-purpose workloads, offering the lowest latencies for file operations. While there is a trade-off of limiting operations to 7,000 per second, general-purpose mode is the best choice for most workloads.

Max I/O If you are running large-scale and data-heavy applications, then choose the max I/O performance option, which provides you with a virtually unlimited ability to scale out throughput and IOPS, but with a trade-off of slightly higher latencies. Use max I/O when you have 10 or more instances accessing your file system concurrently, as shown in Table 3.6.

Table 3.6 I/O Performance Options

Mode	What’s It For?	Advantages	Trade-Offs	When to Use
General purpose (default)	Latency-sensitive applications and general-purpose workloads	Lowest latencies for file operations	Limit of 7,000 ops/sec	Best choice for most workloads
Max I/O	Large-scale and data-heavy applications	Virtually unlimited ability to scale out throughput/ IOPS	Slightly higher latencies	Consider if 10 (or more) instances are accessing your file system concurrently

If you are not sure which mode is best for your usage pattern, use the PercentIOLimit Amazon CloudWatch metric to determine whether you are constrained by general-purpose mode. If you are regularly hitting the 7,000 IOPS limit in general-purpose mode, then you will likely benefit from max I/O performance mode.

As discussed with the CAP theorem earlier in this study guide, there are differences in both performance and trade-off decisions when you’re designing systems that use Amazon EFS and Amazon EBS. The distributed architecture of Amazon EFS results in a small increase in latency for each operation, as the data that you are storing gets pushed across multiple servers in multiple Availability Zones. Amazon EBS can provide lower latency than Amazon EFS, but at the cost of some durability. With Amazon EBS, you provision the size of the device, and if you reach its maximum limit, you must increase its size or add more volumes, whereas Amazon EFS scales automatically. Table 3.7 shows the various performance and other characteristics for Amazon EFS as related to Amazon EBS Provisioned IOPS.

Table 3.7 Amazon EBS Performance Relative to Amazon EFS

		Amazon EFS	Amazon EBS Provisioned IOPS
Performance	Per-operation latency	Low, consistent	Lowest, consistent
	Throughput scale	Multiple GBs per second	Single GB per second
Characteristics	Data availability/ durability	Stored redundantly across multiple Availability Zones	Stored redundantly in a single Availability Zone
	Access	1 to 1000s of EC2 instances, from multiple Availability Zones, concurrently	Single Amazon EC2 instance in a single Availability Zone
	Use cases	Big Data and analytics, media processing workflows, content management, web serving, home directories	Boot volumes, transactional and NoSQL databases, data warehousing, ETL

Security

You can implement security in multiple layers with Amazon EFS by controlling the following:

Network traffic to and from file systems (mount targets) using the following:
- VPC security groups
- Network ACLs
File and directory access by using POSIX permissions
Administrative access (API access) to file systems by using IAM. Amazon EFS supports:
- Action-level permissions
- Resource-level permissions

Familiarize yourself with the Amazon EFS product, details, and FAQ pages. Some exam questions may be answered by components from those pages.

Storage Comparisons

This section provides valuable charts that can serve as a quick reference if you are tasked with choosing a storage system for a particular project or application.

Use Case Comparison

Table 3.8 will help you understand the main properties and use cases for each of the cloud storage products on AWS.

Table 3.8 AWS Cloud Storage Products

If You Need:	Consider Using:
Persistent local storage for Amazon EC2, relational and NoSQL databases, data warehousing, enterprise applications, big data processing, or backup and recovery	Amazon EBS
A file system interface and file system access semantics to make data available to one or more Amazon EC2 instances for content serving, enterprise applications, media processing workflows, big data storage, or backup and recovery	Amazon EFS
A scalable, durable platform to make data accessible from any internet location for user-generated content, active archive, serverless computing, Big Data storage, or backup and recovery	Amazon S3
Highly affordable, long-term storage that can replace tape for archive and regulatory compliance	Amazon S3 Glacier
A hybrid storage cloud augmenting your on-premises environment with AWS cloud storage for bursting, tiering, or migration	AWS Storage Gateway
A portfolio of services to help simplify and accelerate moving data of all types and sizes into and out of the AWS Cloud	AWS Cloud Data Migration Services

Storage Temperature Comparison

Table 3.9 shows a comparison of instance store, Amazon EBS, Amazon S3, and Amazon S3 Glacier.

Understanding Table 3.9 will help you make decisions about latency, size, durability, and cost during the exam.

Comparison of Amazon EBS and Instance Store

Table 3.9 Storage Comparison

	Instance Store	Amazon EBS	Amazon S3	Amazon S3 Glacier
Average latency	ms		ms, sec, min (~ size)	hrs
Data volume	4 GB to 48 TB	1 GiB to 1 TiB	No limit
Item size	Block storage		5 TB max	40 TB max
Request rate	Very high		Low to very high (no limit)	Very low (no limit)
Cost/GB per month	Amazon EC2 instance cost	¢¢	¢
Durability	Low	High	Very high	Very high
Temperature	Hot <——————————————————————————> Cold

Before considering Amazon EC2 instance store as a storage option, make sure that your data does not meet any of these criteria:

Must persist through instance stops, terminations, or hardware failures
Needs to be encrypted at the full volume level
Needs to be backed up with Amazon EBS snapshots
Needs to be removed from instances and reattached to another

If your data meets any of the previous four criteria, use an Amazon EBS volume. Otherwise, compare instance store and Amazon EBS for storage.

Because instance store is directly attached to the host computer, it will have lower latency than an Amazon EBS volume attached to the Amazon EC2 instance. Instance store is provided at no additional cost beyond the price of the Amazon EC2 instance you choose (if the instance has instance store[s] available), whereas Amazon EBS volumes incur an additional cost.

Comparison of Amazon S3, Amazon EBS, and Amazon EFS

Table 3.10 is a useful in helping you to compare performance and storage characteristics for Amazon’s highest-performing file, object, and block cloud storage offerings. This comparison will also be helpful when choosing the right data store for the applications that you are developing. It is also important for the exam.

Table 3.10 Storage Service Comparison (EFS, S3, and EBS)

		File Amazon EFS	Object Amazon S3	Block Amazon EBS
Performance	Per-operation latency	Low, consistent	Low, for mixed request types, and integration with CloudFront	Low, consistent
	Throughput scale	Multiple GB per second		Single GB per second
Characteristics	Data Availability/ Durability	Stored redundantly across multiple Availability Zones		Stored redundantly in a single Availability Zone
	Access	One to thousands of Amazon EC2 instances or on-premises servers, from multiple Availability Zones, concurrently	One to millions of connections over the web	Single Amazon EC2 instance in a single Availability Zone
	Use Cases	Web serving and content management, enterprise applications, media and entertainment, home directories, database backups, developer tools, container storage, Big Data analytics	Web serving and content management, media and entertainment, backups, Big Data analytics, data lake	Boot volumes, transactional and NoSQL databases, data warehousing, ETL

Cloud Data Migration

Data is the cornerstone of successful cloud application deployments. Your evaluation and planning process may highlight the physical limitations inherent to migrating data from on-premises locations into the cloud. To assist you with that process, AWS offers a suite of tools to help you move data via networks, roads, and technology partners in and out of the cloud through offline, online, or streaming models.

The daunting realities of data transport apply to most projects: Knowing how to move to the cloud with minimal disruption, cost, and time, and knowing what is the most efficient way to move your data.

To determine the best-case scenario for efficiently moving your data, use this formula:

Number of Days = (Total Bytes)/(Megabits per second * 125 * 1000 * Network Utilization * 60 seconds * 60 minutes * 24 hours)

For example, if you have a T1 connection (1.544 Mbps) and 1 TB (1024 × 1024 × 1024 × 1024 bytes) to move in or out of AWS, the theoretical minimum time that it would take to load over your network connection at 80 percent network utilization is 82 days.

Instead of using up bandwidth and taking a long time to migrate, many AWS customers are choosing one of the data migration options that are discussed next.

Multiple-choice questions that ask you to choose two or three true answers require that all of your answers be correct. There is no partial credit for getting a fraction correct. Pay extra attention to those questions when doing your review.

AWS Storage Gateway

AWS Storage Gateway is a hybrid cloud storage service that enables your on-premises applications to use AWS cloud storage seamlessly. You can use this service for the following:

Backup and archiving
Disaster recovery
Cloud bursting
Storage tiering
Migration

Your applications connect to the service through a gateway appliance using standard storage protocols, such as NFS and internet Small Computer System Interface (iSCSI). The gateway connects to AWS storage services, such as Amazon S3, Amazon S3 Glacier, and Amazon EBS, providing storage for files, volumes, and virtual tapes in AWS.

File Gateway

A file gateway supports a file interface into Amazon S3, and it combines a cloud service with a virtual software appliance that is deployed into your on-premises environment as a VM. You can think of file gateway as an NFS mount on Amazon S3, allowing you to access your data directly in Amazon S3 from on premises as a file share.

Volume Gateway

A volume gateway provides cloud-based storage volumes that you can mount as iSCSI devices from your on-premises application servers. A volume gateway supports cached mode and stored volume mode configurations.

Note that the volume gateway represents the family of gateways that support block-based volumes, previously referred to as gateway-cached volumes and gateway-stored volumes.

Cached Mode

In the cached volume mode, your data is stored in Amazon S3, and a cache of the frequently accessed data is maintained locally by the gateway. This enables you to achieve cost savings on primary storage and minimize the need to scale your storage on premises while retaining low-latency access to your most used data.

Stored Volume Mode

In the stored volume mode, data is stored on your local storage with volumes backed up asynchronously as Amazon EBS snapshots stored in Amazon S3. This provides durable off-site backups.

Tape Gateway

A tape gateway can be used for backup to migrate off of physical tapes and onto a cost-effective and durable archive backup such as Amazon S3 Glacier. For a tape gateway, you store and archive your data on virtual tapes in AWS. A tape gateway eliminates some of the challenges associated with owning and operating an on-premises physical tape infrastructure. It can also be used for migrating data off of tapes, which are nearing end of life, into a more durable type of storage that still acts like tape.

AWS Import/Export

AWS Import/Export accelerates moving large amounts of data into and out of the AWS Cloud using portable storage devices for transport. It transfers your data directly onto and off of storage devices using Amazon’s high-speed internal network and bypassing the internet.

For significant datasets, AWS Import/Export is often faster than internet transfer and more cost-effective than upgrading your connectivity. You load your data onto your devices and then create a job in the AWS Management Console to schedule shipping of your devices.

You are responsible for providing your own storage devices and the shipping charges to AWS.

It supports (in a limited number of regions) the following:

Importing and exporting of data in Amazon S3 buckets
Importing data into Amazon EBS snapshots

You cannot export directly from Amazon S3 Glacier. You must first restore your objects to Amazon S3 before exporting using AWS Import/Export.

AWS Snowball

AWS Snowball is a petabyte-scale data transport solution that uses physical storage appliances, bypassing the internet, to transfer large amounts of data into and out of Amazon S3.

AWS Snowball addresses common challenges with large-scale data transfers, including the following:

High network costs
Long transfer times
Security concerns

Figure 3.23 shows a physical AWS Snowball device.

The figure shows a physical AWS Snowball device. — **Figure 3.23** AWS Snowball

When you transfer your data with AWS Snowball, you do not need to write any code or purchase any hardware. To transfer data using AWS Snowball, perform the following steps:

Create a job in the AWS Management Console. The AWS Snowball appliance is shipped to you automatically.
When the appliance arrives, attach it to your local network.
Download and run the AWS Snowball client to establish a connection.
Use the client to select the file directories that you need to transfer to the appliance. The client will then encrypt and transfer the files to the appliance at high speed.
Once the transfer is complete and the appliance is ready to be returned, the E Ink shipping label automatically updates and you can track the job status via Amazon Simple Notification Service (Amazon SNS), text messages, or directly in the console.

Table 3.11 shows some common AWS Snowball use cases.

Table 3.11 AWS Snowball Use Cases

Use Case	Description
Cloud migration	If you have large quantities of data that you need to migrate into AWS, AWS Snowball is often much faster and more cost-effective than transferring that data over the internet.
Disaster recovery	In the event that you need to retrieve a large quantity of data stored in Amazon S3 quickly, AWS Snowball appliances can help retrieve the data much quicker than high-speed internet.
Data center decommission	There are many steps involved in decommissioning a data center to make sure that valuable data is not lost. Snowball can help ensure that your data is securely and cost-effectively transferred to AWS during this process.
Content distribution	Use Snowball appliances if you regularly receive or need to share large amounts of data with clients, customers, or business associates. Snowball appliances can be sent directly from AWS to client or customer locations.

AWS Snowball Edge

AWS Snowball Edge is a 100-TB data transfer service with on-board storage and compute power for select AWS capabilities. In addition to transferring data to AWS, AWS Snowball Edge can undertake local processing and edge computing workloads. Figure 3.24 shows a physical AWS Snowball Edge device.

Features of AWS Snowball Edge include the following:

An endpoint on the device that is compatible with Amazon S3
A file interface with NFS support
A cluster mode where multiple AWS Snowball Edge devices can act as a single, scalable storage pool with increased durability
The ability to run AWS Lambda powered by AWS IoT Greengrass functions as data is copied to the device
Encryption taking place on the appliance itself

The transport of data is done by shipping the data in the appliances through a regional carrier. The appliance differs from the standard AWS Snowball because it can bring the power of the AWS Cloud to your local environment, with local storage and compute functionality.

There are three types of jobs that can be performed with Snowball Edge appliances:

Import jobs into Amazon S3
Export jobs from Amazon S3
Local compute and storage-only jobs

Use AWS Snowball Edge when you need the following:

Local storage and compute in an environment that might or might not have an internet connection
To transfer large amounts of data into and out of Amazon S3, bypassing the internet

Table 3.12 shows the different use cases for the different AWS Snowball devices.

Table 3.12 AWS Snowball Device Use Cases

Use Case	AWS Snowball	AWS Snowball Edge
Import data into Amazon S3	✓	✓
Copy data directly from HDFS	✓
Export from Amazon S3	✓	✓
Durable local storage		✓
Use in a cluster of devices		✓
Use with AWS IoT Greengrass		✓
Transfer files through NFS with a GUI		✓

AWS Snowmobile

AWS Snowmobile is an exabyte-scale data transfer service used to move extremely large amounts of data from on premises to AWS. You can transfer up to 100 PB per AWS Snowmobile, a 45-foot long ruggedized shipping container pulled by a semi-trailer truck.

AWS Snowmobile makes it easy to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration. In 2017, one AWS customer moved 8,700 tapes with 54 million files to Amazon S3 using AWS Snowmobile. Figure 3.25 shows an AWS Snowmobile shipping container being pulled by a semi-trailer truck.

The figure shows an AWS Snowmobile shipping container being pulled by a semi-trailer truck. — **Figure 3.25** AWS Snowmobile

How do you choose between AWS Snowmobile and AWS Snowball? To migrate large datasets of 10 PB or more in a single location, you should use AWS Snowmobile. For datasets that are less than 10 PB or distributed in multiple locations, you should use AWS Snowball.

Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose lets you prepare and load real-time data streams into data stores and analytics tools. Although it has much broader uses for loading data continuously for data streaming and analytics, it can be used as a one-time tool for data migration into the cloud.

Amazon Kinesis Data Firehose can capture, transform, and load streaming data into Amazon S3 and Amazon Redshift, which will be discussed further in Chapter 4, “Hello, Databases.” With Amazon Kinesis Data Firehose, you can avoid writing applications or managing resources. When you configure your data producers to send data to Amazon Kinesis Data Firehose, as shown in Figure 3.26, it automatically delivers the data to the destination that you specified. This is an efficient option to transform and deliver data from on premises to the cloud.

The figure shows an example of Amazon Kinesis Data Firehose. — **Figure 3.26** Amazon Kinesis Data Firehose

Destinations include the following:

Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk

Key Concepts

As you get started with Amazon Kinesis Data Firehose, you will benefit from understanding the concepts described next.

Kinesis Data Delivery Stream

You use Amazon Kinesis Data Firehose by creating an Amazon Kinesis data delivery stream and then sending data to it.

Record

A record is the data that your producer sends to a Kinesis data delivery stream, with a maximum size of 1,000 KB.

Data Producer

Data producers send records to Amazon Kinesis data delivery streams. For example, your web server could be configured as a data producer that sends log data to an Amazon Kinesis delivery stream.

Buffer Size and Buffer Interval

Amazon Kinesis Data Firehose buffers incoming data to a certain size or for a certain period of time before delivering it to destinations. Buffer size is in megabytes, and buffer interval is in seconds.

Data Flow

You can stream data to your Amazon S3 bucket, as shown in Figure 3.27. If data transformation is enabled, you can optionally back up source data to another Amazon S3 bucket.

The figure shows how to stream data to your Amazon S3 bucket. — **Figure 3.27** Streaming to Amazon S3

AWS Direct Connect

Using AWS Direct Connect, you can establish private connectivity between AWS and your data center, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections.

These benefits can then be applied to storage migration. Transferring large datasets over the internet can be time-consuming and expensive. When you use the cloud, you may find that transferring large datasets can be slow because your business-critical network traffic is contending for bandwidth with your other internet usage. To decrease the amount of time required to transfer your data, increase the bandwidth to your internet service provider. Be aware that this frequently requires a costly contract renewal and minimum commitment.

More details on using AWS Direct Connect will be provided in Chapter 4.

VPN Connection

You can connect your Amazon VPC to remote networks by using a VPN connection. Table 3.13 shows some of the connectivity options available to you.

Table 3.13 Amazon VPC Connectivity Options

VPN Connectivity Option	Description
AWS managed VPN	Create an IP Security (IPsec) VPN connection between your VPC and your remote network. On the AWS side of the VPN connection, a virtual private gateway provides two VPN endpoints (tunnels) for automatic failover. You configure your customer gateway on the remote side of the VPN connection.
AWS VPN CloudHub	If you have more than one remote network (for example, multiple branch offices), create multiple AWS managed VPN connections via your virtual private gateway to enable communication between these networks.
Third-party software VPN appliance	Create a VPN connection to your remote network by using an Amazon EC2 instance in your VPC that’s running a third-party software VPN appliance. AWS does not provide or maintain third-party software VPN appliances; however, you can choose from a range of products provided by partners and open-source communities.

You can also use AWS Direct Connect to create a dedicated private connection from a remote network to your VPC. You can combine this connection with an AWS managed VPN connection to create an IPsec-encrypted connection.

You will learn more about VPN connections in subsequent chapters.

Summary

AWS cloud computing provides a reliable, scalable, and secure place for your data. Cloud storage is a critical component of cloud computing, holding the information used by applications. Big Data analytics, data warehouses, Internet of Things, databases, and backup and archive applications all rely on some form of data storage architecture. Cloud storage is typically more reliable, scalable, and secure than traditional on-premises storage systems.

AWS offers a complete range of cloud storage services to support both application and archival compliance requirements. You may choose from object, file, and block storage services and cloud data migration options to start designing the foundation of your cloud IT environment.

Amazon EBS provides highly available, consistent, low-latency, persistent local block storage for Amazon EC2. It helps you to tune applications with the right storage capacity, performance, and cost.

Amazon EFS provides a simple, scalable file system interface and file system access semantics to make data available to one or more EC2 instances as block storage. Amazon EFS grows and shrinks capacity automatically, and it provides high throughput with consistent low latencies. Amazon EFS is designed for high availability and durability, and it provides performance for a broad spectrum of workloads and applications.

Amazon S3 is a form of object storage that provides a scalable, durable platform to make data accessible from any internet location, and it allows you to store and access any type of data over the internet. Amazon S3 is secure, 99.999999999 percent durable, and scales past tens of trillions of objects.

Amazon S3 Glacier provides extremely low-cost and highly durable object storage for long-term backup and archiving of any type of data. Amazon S3 Glacier is a solution for customers who want low-cost storage for infrequently accessed data. It can replace tape, and assist with compliance in highly regulated organizations.

Amazon offers a full portfolio of cloud data migration services to help simplify and accelerate moving data of all types and sizes into and out of the AWS Cloud. These include AWS Storage Gateway, AWS Import/Export Disk, AWS Snowball, AWS Snowball Edge, AWS Snowmobile, Amazon Kinesis Data Firehose, AWS Direct Connect, and a VPN connection.

Understanding when to use the right tool for your data storage and data migration options is a key component of the exam, including data dimension, block versus object versus file storage, data structure, and storage temperature. Be ready to compare and contrast the durability, availability, latency, means of access, and cost of different storage options for a given use case.

Exam Essentials

Know the different data dimensions. Consider the different data dimensions when choosing which storage option and storage class will be most appropriate for your data. This includes velocity, variety, volume, storage temperature (hot, warm, cold, frozen), data value, transient, reproducible, authoritative, and critical/regulated data.

Know the difference between block, object, and file storage. Block storage is commonly dedicated, low-latency storage for each host and is provisioned with each instance. Object storage is developed for the cloud, has vast scalability, is accessed over the Web, and is not directly attached to an instance. File storage enables accessing shared files as a file system.

Know the AWS shared responsibility model and how it applies to storage. AWS is responsible for securing the storage services. You are responsible for securing access to the artifacts that you create or objects that you store.

Know what Amazon EBS is and for what it is commonly used. Amazon EBS provides persistent block storage volumes for use with Amazon EC2 instances. It is designed for application workloads that benefit from fine tuning for performance, cost, and capacity. Typical use cases include Big Data analytics engines, relational and NoSQL databases, stream and log processing applications, and data warehousing applications. Amazon EBS volumes also serve as root volumes for Amazon EC2 instances.

Know what Amazon EC2 instance store is and what it is commonly used for. An instance store provides temporary block-level storage for your instance. It is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers. In some cases, you can use instance-store backed volumes. Do not use instance store (also known as ephemeral storage) for either production data or data that must be kept durable.

Know what Amazon S3 is and what it is commonly used for. Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. It is secure, durable, and highly scalable cloud storage using a simple web services interface. Amazon S3 is commonly used for backup and archiving, content storage and distribution, Big Data analytics, static website hosting, cloud-native application hosting, and disaster recovery, and as a data lake.

Know the basic concepts of Amazon S3. Amazon S3 stores data as objects within resources called buckets. You can store as many objects as desired within a bucket, and write, read, and delete objects in your bucket. Objects contain data and metadata and are identified by a user-defined key in a flat file structure. Interfaces to Amazon S3 include a native REST interface, SDKs for many languages, the AWS CLI, and the AWS Management Console.

Know how to create a bucket, how to upload, download, and delete objects, how to make objects public, and how to open an object URL.

Understand how security works in Amazon S3. By default, new buckets are private and nothing is publicly accessible. When you add an object to a bucket, it is private by default.

Know how much data you can store in Amazon S3. The total volume of data and number of objects that you can store in Amazon S3 are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, consider using the Multipart Upload capability.

Know the Amazon S3 service limit for buckets per account. One hundred buckets are allowed per account.

Understand the durability and availability of Amazon S3. Amazon S3 standard storage is designed for 11 nines of durability and four nines of availability of objects over a given year. Other storage classes differ. Reduced Redundancy Storage (RRS) storage class is less durable than Standard, and it is intended for noncritical, reproducible data.

Know the data consistency model of Amazon S3. Amazon S3 is eventually consistent, but it offers read-after-write consistency (read after write consistency for PUT of new objects and eventual consistency for overwrite PUT and DELETE).

Know the Amazon S3 storage classes and use cases for each. Standard is used to store general-purpose data that needs high durability, high performance, and low latency access. Standard_IA is used for data that is less frequently accessed but that needs the same performance and availability when accessed. OneZone_IA is similar to Standard_IA, but it is stored only in a single Availability Zone, costing 20 percent less. However, data stored with OneZone_IA will be permanently lost in the event of an Availability Zone destruction. Reduced_Redundancy offers lower durability at lower cost for easily-reproducible data. Amazon S3 Glacier is used to store rarely accessed archival data at an extremely low cost, when three- to five-hour retrieval time is acceptable under the standard retrieval option. There are other retrieval options for higher and lower cost at shorter and longer retrieval times, including expedited retrieval (on-demand or provisioned, 1–5 minutes) and bulk retrieval (5–12 hours).

Every object within a bucket can be designated to a different storage class.

Know how to enable static web hosting on Amazon S3. The steps to enable static web hosting on Amazon S3 require you to do the following:

Create a bucket with the website hostname.
Upload your static content and make it public.
Enable static website hosting on the bucket.
Indicate the index and error page objects.

Know how to encrypt your data on Amazon S3. For server-side encryption, use SSE-SE (Amazon S3 Managed Keys), SSE-C (Customer-Provided Keys), and SSE-KMS (KMS-Managed Keys). For client-side encryption, choose from a client-side master key or an AWS KMS managed customer master key.

Know how to protect your data on Amazon S3. Know the different options for protecting your data in flight and in transit. Encrypt data in flight using HTTPS and at rest using server-side or client-side encryption. Enable versioning to keep multiple versions of an object in a bucket. Enable MFA Delete to protect against accidental deletion. Use ACLs, Amazon S3 bucket policies, and IAM policies for access control. Use presigned URLs for time-limited download access. Use cross-region replication to replicate data to another region automatically.

Know how to use lifecycle configuration rules. Lifecycle rules can be used to manage your objects so that they are stored cost-effectively throughout their lifecycle. There are two types of actions. Transition actions define when an object transitions to another storage class. Expiration actions define when objects expire and will be deleted on your behalf.

Know what Amazon EFS is and what it is commonly used for. Amazon EFS provides simple, scalable, elastic file storage for use with AWS services and on-premises resources. Amazon EFS is easy to use and offers a simple interface that allows you to create and configure file systems quickly and easily. Amazon EFS is built to scale elastically on demand without disrupting applications, growing and shrinking automatically as you add and remove files, so your applications have the storage that they need, when they need it. Amazon EFS is designed for high availability and durability. Amazon EFS can be mounted to multiple Amazon EC2 instances at the same time.

Know the basics of Amazon S3 Glacier as a stand-alone service. Data is stored in encrypted archives that can be as large as 40 TB. Archives typically contain TAR and ZIP files. Vaults are containers for archives, and vaults can be locked for compliance.

Know which storage option to choose based on storage temperature. For hot to warm storage, use Amazon EC2 instance store, Amazon EBS, or Amazon S3. For cold storage, choose Amazon S3 Glacier.

Know which storage option to choose based on latency. Amazon EC2 instance store and Amazon EBS are designed for millisecond latency. Amazon S3 depends on size, anywhere from milliseconds to seconds to minutes. Amazon S3 Glacier is minutes to hours depending on retrieval option.

Know which storage option to choose based on data volume. Amazon EC2 instance store can be from 4 GB to 48 TB. Amazon EBS can be from 1 GiB to 16 TiB. Amazon S3 and Amazon S3 Glacier have no limit.

Know which storage option to choose based on item size. Amazon EC2 instance store and Amazon EBS depend on the size of the block storage and operating system limits. Amazon S3 has a 5 TB max size per object, but objects may be split. Amazon S3 Glacier has a 40 TB maximum.

Know when you should use Amazon EBS, Amazon EFS, Amazon S3, Amazon S3 Glacier, or AWS Storage Gateway for your data. For persistent local storage for Amazon EC2, use Amazon EBS. For a file system interface and file system access semantics to make data available to one or more Amazon EC2 instances, use Amazon EFS. For a scalable, durable platform to make data accessible from any internet location, use Amazon S3. For highly affordable, long-term cold storage, use Amazon S3 Glacier. For a hybrid storage cloud augmenting your on-premises environment with Amazon cloud storage, use AWS Storage Gateway.

Know when to choose Amazon EBS or Amazon EC2 instance store. Amazon EBS is most often the default option. However, Amazon EC2 instance store may be an option if your data does not meet any of the following criteria:

Must persist through instance stops, terminations, or hardware failures
Needs to be encrypted at the full volume level
Needs to be backed up with EBS snapshots
Needs to be removed from one instance and reattached to another

Know the different cloud data migration options. There are a number of options for migrating your data to the AWS Cloud, or having a hybrid data solution between AWS and your data center or on premises. These include (but are not limited to) AWS Storage Gateway, AWS Import/Export, AWS Snowball, AWS Snowball Edge, AWS Snowmobile, Amazon Kinesis Data Firehose, AWS Direct Connect, and AWS VPN connections. Know when to choose one over the other based on time, cost, or volume.

Know what AWS Storage Gateway is and how it is used for cloud data migration. AWS Storage Gateway is a hybrid cloud storage service that enables your on-premises applications to use AWS cloud storage seamlessly. Use this for data migration by means of a gateway that connects to AWS storage services, such as Amazon S3, Amazon S3 Glacier, and Amazon EBS.

Know what AWS Import/Export Disk is and how it is used for cloud data migration. AWS Import/Export Disk accelerates moving large amounts of data into and out of the AWS Cloud using portable storage devices for transport. It transfers your data directly onto and off of storage devices using Amazon’s high-speed internal network and bypassing the internet. For significant data sets, it is often much faster than transferring the data via the internet. You provide the hardware.

Know what AWS Snowball is and how it is used for cloud data migration. Snowball is a petabyte-scale data transport solution that uses devices designed to be secure to transfer large amounts of data into and out of the AWS Cloud. Using Snowball addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns. You can transfer data at as little as one-fifth the cost of transferring data via high-speed internet. AWS provides the hardware.

Know what AWS Snowball Edge is and how it is used for cloud data migration. AWS Snowball Edge is a 100-TB data transfer device with on-board storage and compute capabilities. Use it to move large amounts of data into and out of AWS, as a temporary storage tier for large local datasets, or to support local workloads in remote or offline locations. AWS Snowball Edge is a fast and inexpensive way to transfer large amounts of data when migrating to AWS.

Know what AWS Snowmobile is and how it is used for cloud data migration. AWS Snowmobile is an exabyte-scale data transfer service used to move extremely large amounts of data to AWS. You can transfer up to 100 PB per Snowmobile, a 45-foot long ruggedized shipping container pulled by a semi-trailer truck. Snowmobile makes it easy to move massive volumes of data to the cloud, even a complete data center migration.

Know what Amazon Kinesis Data Firehose is and how it is used for cloud data migration. Amazon Kinesis Data Firehose is the easiest way to load streaming data reliably into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. Kinesis Data Firehose can be used to transform and migrate data from on premises into the cloud.

Know what AWS Direct Connect is and how it is used for cloud data migration. Use AWS Direct Connect to establish private connectivity between AWS and your data center, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections.

Know what a VPN connection is and how it is used for cloud data migration. Connect your Amazon VPC to remote networks by using a VPN connection to increase privacy while migrating your data.

Know which tool to use for migrating storage to the AWS Cloud based on data size, timeline, and cost. There are two ways to migrate data: online and offline.

Online Use AWS Direct Connect to connect your data center privately and directly to an AWS Region. Use AWS Snowball to transport petabytes of data physically in batches to the cloud. Use Snowball Edge to build hybrid storage that preserves existing on-premises investment and adds AWS services. Use AWS Snowmobile to migrate exabytes of data in batches to the cloud. Use Amazon S3 Transfer Acceleration to work with Amazon S3 over long geographic distances.

Offline Use AWS Storage Gateway to integrate existing on-premises resources with the cloud. Use AWS Snowball Edge to transport petabytes of data physically in an appliance with on-board storage and compute capabilities. Use Amazon Kinesis Data Firehose to collect and ingest multiple streaming data sources or perform ETL on data while migrating to the AWS Cloud.

Resources to Review

Cloud Storage with AWS:
https://aws.amazon.com/products/storage/
AWS Storage Optimization (Whitepaper):
https://docs.aws.amazon.com/aws-technical-content/latest/ cost-optimization-storage-optimization/cost-optimization-storage- optimization.pdf
AWS Storage Services Overview (Whitepaper):
https://aws.amazon.com/whitepapers/storage-options-aws-cloud/
Overview of AWS Security—Storage Services (Whitepaper):
https://d1.awsstatic.com/whitepapers/Security/Security_Storage_ Services_Whitepaper.pdf
Writing IAM Policies—How to Grant Access to an Amazon S3 Bucket (AWS Security Blog):
https://aws.amazon.com/blogs/security/writing-iam-policies-how-to-grant-access-to-an-amazon-s3-bucket/
Leveraging the Breadth of Storage Services and the Ecosystem at AWS—Unlock the Full Potential of Public Cloud IaaS:
https://d0.awsstatic.com/analyst-reports/US41693416.pdf
Cloud Data Migration Services:
https://aws.amazon.com/cloud-data-migration/
AWS Migration (Whitepaper):
https://d1.awsstatic.com/whitepapers/Migration/aws-migration- whitepaper.pdf
AWS Storage Gateway (Whitepaper):
https://d1.awsstatic.com/whitepapers/aws-storage-gateway-file-gateway-for-hybrid-architectures.pdf
Hosting Static Websites on AWS (Whitepaper):
https://d1.awsstatic.com/whitepapers/Building%20Static%20Websites%20on%20AWS.pdf
Encrypting Data at Rest (Whitepaper):
https://d0.awsstatic.com/whitepapers/AWS_Securing_Data_at_Rest_with_Encryption.pdf
Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility (Whitepaper):
https://docs.aws.amazon.com/aws-technical-content/latest/building-data-lakes/building-data-lakes-on-aws.pdf
What Is Cloud Object Storage?
https://aws.amazon.com/what-is-cloud-object-storage/
Amazon Simple Storage Service—Getting Started Guide:
http://docs.aws.amazon.com/AmazonS3/latest/gsg/
Amazon Simple Storage Service—Developer Guide:
https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
VPC Endpoints:
https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-endpoints .html
Amazon S3 Glacier—Developer Guide:
https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html
Amazon Elastic File System—User Guide:
https://docs.aws.amazon.com/efs/latest/ug/
Deep Dive on Elastic File System—2017 AWS Online Tech Talks (Video):
https://youtu.be/NhI0g8vI5M0
AWS re:Invent 2017: Deep Dive on Amazon Elastic File System (Amazon EFS) (STG307) (Video):
https://www.youtube.com/watch?v=VffbHp34UzQ
Amazon Elastic Block Store—Linux:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html
Amazon Elastic Block Store—Windows:
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/AmazonEBS.html
Amazon EC2 Instance Storage:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html
AWS Storage Gateway—User Guide:
https://docs.aws.amazon.com/storagegateway/latest/userguide/ WhatIsStorageGateway.html
AWS Import/Export Disk:
https://aws.amazon.com/snowball/disk/
AWS Snowball—User Guide:
http://docs.aws.amazon.com/snowball/latest/ug/whatissnowball.html
AWS Snowball Edge—Developer Guide:
http://docs.aws.amazon.com/snowball/latest/developer-guide/whatisedge .html
AWS Snowmobile:
https://aws.amazon.com/snowmobile/
Amazon Kinesis Data Firehose—Developer Guide:
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
AWS Direct Connect—User Guide:
http://docs.aws.amazon.com/directconnect/latest/UserGuide/
VPN Connections:
https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpn-connections .html
How to Use Bucket Policies and Apply Defense-in-Depth to Help Secure Your Amazon S3 Data:
https://aws.amazon.com/blogs/security/how-to-use-bucket-policies-and-apply-defense-in-depth-to-help-secure-your-amazon-s3-data/
IAM Policies and Bucket Policies and ACLs! Oh, My! (Controlling Access to S3 Resources):
https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/
AWS re:Invent Storage State of the Union (Video):
https://www.youtube.com/watch?v=U-flt95opTw
AWS re:Invent Best Practices for Amazon S3 (STG302) (Video):
https://www.youtube.com/watch?v=UKuL1K3oWuo
What Is a Data Lake?
https://aws.amazon.com/big-data/data-lake-on-aws/
Amazon S3 Service Level Agreement:
https://aws.amazon.com/s3/sla/
Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3):
https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption .html
Picking the Right Data Store for Your Workload:
https://aws.amazon.com/blogs/startups/picking-the-right-data-store-for-your-workload/
Demystifying Storage on AWS (Video):
https://www.youtube.com/watch?v=6UWmN2RbsnY

Exercises

For assistance in completing the following exercises, refer to the Amazon Simple Storage Service Developer Guide:

https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html

We assume that you have performed the Exercises in Chapter 1 and Chapter 2 to set up your development environment in AWS Cloud9, or have done so on your own system with the AWS SDK.

For instructions on creating and testing a working sample, see Testing the Amazon S3 Java Code Examples here:

https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingTheMPJavaAPI .html#TestingJavaSamples

Exercise 3.1

Create an Amazon Simple Storage Service (Amazon S3) Bucket

In this exercise, you will create an Amazon S3 bucket using the AWS SDK for Java. You will use this bucket in the exercises that follow.

For assistance in completing this exercise, copying this code, or for code in other languages, see the following documentation:

https://docs.aws.amazon.com/AmazonS3/latest/dev/create-bucket-get-location-example.html

Enter the following code in your preferred development environment for Java:

import java.io.IOException;

import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.CreateBucketRequest;
import com.amazonaws.services.s3.model.GetBucketLocationRequest;

public class CreateBucket {

    public static void main(String[] args) throws IOException {
        String clientRegion = "*** Client region ***";
        String bucketName = "*** Bucket name ***";

        try {
            AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                    .withCredentials(new ProfileCredentialsProvider())
                    .withRegion(clientRegion)
                    .build();

            if (!s3Client.doesBucketExistV2(bucketName)) {
                // Because the CreateBucketRequest object doesn't specify a region, the

                // bucket is created in the region specified in the client.

                s3Client.createBucket(new CreateBucketRequest(bucketName));

                // Verify that the bucket was created by retrieving it and checking its location.

                String bucketLocation = s3Client.getBucketLocation(new  GetBucketLocationRequest(bucketName));
                System.out.println("Bucket location: " + bucketLocation);
            }
        }
        catch(AmazonServiceException e) {
            // The call was transmitted successfully, but Amazon S3 couldn't process

            // it and returned an error response.

            e.printStackTrace();
        }
        catch(SdkClientException e) {
            // Amazon S3 couldn't be contacted for a response, or the client

            // couldn't parse the response from Amazon S3.

            e.printStackTrace();
        }
    }
}

Replace the static variable values for clientRegion and bucketName. Note that bucket names must be unique across all of AWS. Make a note of these two values; you will use the same region and bucket name for the exercises that follow in this chapter.
Execute the code. Your bucket gets created with the name you specified in the region you specified. A successful result shows the following output:
```
Bucket Location: [bucketLocation]
```

Exercise 3.2

Upload an Object to a Bucket

Now that you have a bucket, you can add objects to it. In this example, you will create two objects. The first object has a text string as data, and the second object is a file. This example creates the first object by specifying the bucket name, object key, and text data directly in a call to AmazonS3Client.putObject(). The example creates a second object by using a PutObjectRequest that specifies the bucket name, object key, and file path. The PutObjectRequest also specifies the ContentType header and title metadata.

For assistance in completing this exercise, copying this code, or for code in other languages, see the following documentation:

https://docs.aws.amazon.com/AmazonS3/latest/dev/UploadObjSingleOpJava .html

Enter the following code in your preferred development environment for Java:

import java.io.File;
import java.io.IOException;

import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;

public class UploadObject {

    public static void main(String[] args) throws IOException {
        String clientRegion = "*** Client region ***";
        String bucketName = "*** Bucket name ***";
        String stringObjKeyName = "*** String object key name ***";
        String fileObjKeyName = "*** File object key name ***";
        String fileName = "*** Path to file to upload ***";

        try {
            AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                    .withRegion(clientRegion)
                    .withCredentials(new ProfileCredentialsProvider())
                    .build();

            // Upload a text string as a new object.

            s3Client.putObject(bucketName, stringObjKeyName, "Uploaded String Object");

            // Upload a file as a new object with ContentType and title specified.

            PutObjectRequest request = new PutObjectRequest(bucketName, fileObjKeyName, new File(fileName));
            ObjectMetadata metadata = new ObjectMetadata();
            metadata.setContentType("plain/text");
            metadata.addUserMetadata("x-amz-meta-title", "someTitle");
            request.setMetadata(metadata);
            s3Client.putObject(request);
        }
        catch(AmazonServiceException e) {
            // The call was transmitted successfully, but Amazon S3 couldn't process

            // it, so it returned an error response.

            e.printStackTrace();
        }
        catch(SdkClientException e) {
            // Amazon S3 couldn't be contacted for a response, or the client

            // couldn't parse the response from Amazon S3.

            e.printStackTrace();
        }
    }
}

Replace the static variable values for clientRegion and bucketName used in the previous exercise.
Replace the value for stringObjKeyName with the name of the key that you intend to create in your Amazon S3 bucket, which will upload a text string as a new object.
Replace the Uploaded String Object text with the text being placed inside the object that you are generating.
Replace the someTitle text in the code with your own metadata title for the object that you are uploading.
Create a local file on your machine and then replace the value for fileName with the full path and filename of the file that you created.
Replace the fileObjKeyName with the key name that you want for the file that you will be uploading. A file can be uploaded with a different name than the filename that’s used locally.
Execute the code. Your bucket gets created with the name that you specified in the region that you specified. A successful result without errors will create two objects in your bucket.

Exercise 3.3

Emptying and Deleting a Bucket

Now that you have finished with the Amazon S3 exercises, you will want to clean up your environment by deleting all the files and the bucket you created. It is easy to delete an empty bucket. However, in some situations, you may need to delete or empty a bucket that contains objects. In this exercise, we show you how to delete objects and then delete the bucket.

For assistance in completing this exercise, copying this code, or for code in other languages, see the following documentation:

https://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html

https://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket .html#delete-bucket-sdk-java

Enter the following code in your preferred development environment for Java:

import java.util.Iterator;

import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListVersionsRequest;
import com.amazonaws.services.s3.model.ObjectListing;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import com.amazonaws.services.s3.model.S3VersionSummary;
import com.amazonaws.services.s3.model.VersionListing;

public class DeleteBucket {

    public static void main(String[] args) {
        String clientRegion = "*** Client region ***";
        String bucketName = "*** Bucket name ***";

        try {
            AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                    .withCredentials(new ProfileCredentialsProvider())
                    .withRegion(clientRegion)
                    .build();

            // Delete all objects from the bucket. This is sufficient

            // for unversioned buckets. For versioned buckets, when you attempt to delete objects, Amazon S3 inserts

            // delete markers for all objects, but doesn't delete the object versions.

            // To delete objects from versioned buckets, delete all of the object versions before deleting

            // the bucket (see below for an example).

            ObjectListing objectListing = s3Client.listObjects(bucketName);
            while (true) {
                Iterator<S3ObjectSummary> objIter = objectListing.getObjectSummaries().iterator();
                while (objIter.hasNext()) {
                    s3Client.deleteObject(bucketName, objIter.next().getKey());
                }

                // If the bucket contains many objects, the listObjects() call

                // might not return all of the objects in the first listing. Check to

                // see whether the listing was truncated. If so, retrieve the next page of objects

                // and delete them.

                if (objectListing.isTruncated()) {
                    objectListing = s3Client.listNextBatchOfObjects (objectListing);
                } else {
                    break;
                }
            }

            // Delete all object versions (required for versioned buckets).

            VersionListing versionList = s3Client.listVersions(new ListVersionsRequest().withBucketName(bucketName));
            while (true) {
                Iterator<S3VersionSummary> versionIter = versionList.getVersionSummaries().iterator();
                while (versionIter.hasNext()) {
                    S3VersionSummary vs = versionIter.next();
                    s3Client.deleteVersion(bucketName, vs.getKey(), vs.getVersionId());
                }

                if (versionList.isTruncated()) {
                    versionList = s3Client.listNextBatchOfVersions (versionList);
                } else {
                    break;
                }
            }

            // After all objects and object versions are deleted, delete the bucket.

            s3Client.deleteBucket(bucketName);
        }
        catch(AmazonServiceException e) {
            // The call was transmitted successfully, but Amazon S3 couldn't process

            // it, so it returned an error response.

            e.printStackTrace();
        }
        catch(SdkClientException e) {
            // Amazon S3 couldn't be contacted for a response, or the client couldn't

            // parse the response from Amazon S3.

            e.printStackTrace();
        }
    }
}

Replace the static variable values for clientRegion and bucketName with the values that you used in the previous steps.
Execute the code.
When execution is complete without errors, both of your objects and your bucket will have been deleted.

Review Questions

You are developing an application that will run across dozens of instances. It uses some components from a legacy application that requires some configuration files to be copied from a central location and be held on a volume local to each of the instances. You plan to modify your application with a new component in the future that will hold this configuration in Amazon DynamoDB. However, in the interim, which storage option should you use that will provide the lowest cost and the lowest latency for your application to access the configuration files?
1. Amazon S3
2. Amazon EBS
3. Amazon EFS
4. Amazon EC2 instance store
In what ways does Amazon Simple Storage Service (Amazon S3) object storage differ from block and file storage? (Select TWO.)
1. Amazon S3 stores data in fixed size blocks.
2. Objects are identified by a numbered address.
3. Object can be any size.
4. Objects contain both data and metadata.
5. Objects are stored in buckets.
You are restoring an Amazon Elastic Block Store (Amazon EBS) volume from a snapshot. How long will it take before the data is available?
1. It depends on the provisioned size of the volume.
2. The data will be available immediately.
3. It depends on the amount of data stored on the volume.
4. It depends on whether the attached instance is an Amazon EBS–optimized instance.
What are some of the key characteristics of Amazon Simple Storage Service (Amazon S3)? (Select THREE.)
1. All objects have a URL.
2. Amazon S3 can store unlimited amounts of data.
3. Buckets can be mounted to the file system of multiple Amazon EC2 instances.
4. Amazon S3 uses a Representational State Transfer (REST) application program interface (API).
5. You must pre-allocate the storage in a bucket.
Amazon S3 Glacier is well-suited to data that is which of the following? (Select TWO.)
1. Infrequently or rarely accessed
2. Must be immediately available when needed
3. Is available after a three- to five-hour restore period
4. Is frequently erased within 30 days
You have valuable media files hosted on AWS and want them to be served only to authenticated users of your web application. You are concerned that your content could be stolen and distributed for free. How can you protect your content?
1. Use static web hosting.
2. Generate presigned URLs for content in the web application.
3. Use AWS Identity and Access Management (IAM) policies to restrict access.
4. Use logging to track your content.
Which of the following are features of Amazon Elastic Block Store (Amazon EBS)? (Select TWO.)
1. Data stored on Amazon EBS is automatically replicated within an Availability Zone.
2. Amazon EBS data is automatically backed up to tape.
3. Amazon EBS volumes can be encrypted transparently to workloads on the attached instance.
4. Data on an Amazon EBS volume is lost when the attached instance is stopped.
Which option should you choose for Amazon EFS when tens, hundreds, or thousands of Amazon EC2 instances will be accessing the file system concurrently?
1. General-Purpose performance mode
2. RAID 0
3. Max I/O performance mode
4. Change to a larger instance
Which of the following must be performed to host a static website in an Amazon Simple Storage Service (Amazon S3) bucket? (Select THREE.)
1. Configure the bucket for static hosting, and specify an index and error document.
2. Create a bucket with the same name as the website.
3. Enable File Transfer Protocol (FTP) on the bucket.
4. Make the objects in the bucket world-readable.
5. Enable HTTP on the bucket.
You have a workload that requires 1 TB of durable block storage at 1,500 IOPS during normal use. Every night there is an extract, transform, load (ETL) task that requires 3,000 IOPS for 15 minutes. What is the most appropriate volume type for this workload?
1. Use a Provisioned IOPS SSD volume at 3,000 IOPS.
2. Use an instance store.
3. Use a general-purpose SSD volume.
4. Use a magnetic volume.
Which statements about Amazon S3 Glacier are true? (Select THREE.)
1. It stores data in objects that live in buckets.
2. Archives are identified by user-specified key names.
3. Archives take 3–5 hours to restore.
4. Vaults can be locked.
5. It can be used as a standalone service and as an Amazon S3 storage class.
You are developing an application that will be running on several hundred Amazon EC2 instances. The application on each instance will be required to reach out through a file system protocol concurrently to a file system holding the files. Which storage option should you choose?
1. Amazon EFS
2. Amazon EBS
3. Amazon EC2 instance store
4. Amazon S3
You need to take a snapshot of an Amazon Elastic Block Store (Amazon EBS) volume. How long will the volume be unavailable?
1. It depends on the provisioned size of the volume.
2. The volume will be available immediately.
3. It depends on the amount of data stored on the volume.
4. It depends on whether the attached instance is an Amazon EBS–optimized instance.
Amazon Simple Storage Service (S3) bucket policies can restrict access to an Amazon S3 bucket and objects by which of the following? (Select THREE.)
1. Company name
2. IP address range
3. AWS account
4. Country of origin
5. Objects with a specific prefix
Which of the following are not appropriate use cases for Amazon Simple Storage Service (Amazon S3)? (Select TWO.)
1. Storing static web content or hosting a static website
2. Storing a file system mounted to an Amazon Elastic Compute Cloud (Amazon EC2) instance
3. Storing backups for a relational database
4. Primary storage for a database
5. Storing logs for analytics
Which features enable you to manage access to Amazon Simple Storage Service (Amazon S3) buckets or objects? (Select THREE.)
1. Enable static website hosting on the bucket.
2. Create a presigned URL for an object.
3. Use an Amazon S3 Access Control List (ACL) on a bucket or object.
4. Use a lifecycle policy.
5. Use an Amazon S3 bucket policy.
Your application stores critical data in Amazon Simple Storage Service (Amazon S3), which must be protected against inadvertent or intentional deletion. How can this data be protected? (Select TWO.)
1. Use cross-region replication to copy data to another bucket automatically.
2. Set a vault lock.
3. Enable versioning on the bucket.
4. Use a lifecycle policy to migrate data to Amazon S3 Glacier.
5. Enable MFA Delete on the bucket.
You have a set of users that have been granted access to your Amazon S3 bucket. For compliance purposes, you need to keep track of all files accessed in that bucket. To have a record of who accessed your Amazon Simple Storage Service (Amazon S3) data and from where, what should you do?
1. Enable versioning on the bucket.
2. Enable website hosting on the bucket.
3. Enable server access logging on the bucket.
4. Create an AWS Identity and Access Management (IAM) bucket policy.
5. Enable Amazon CloudWatch logs.
What are some reasons to enable cross-region replication on an Amazon Simple Storage Service (Amazon S3) bucket? (Select THREE.)
1. Your compliance requirements dictate that you store data at an even further distance than Availability Zones, which are tens of miles apart.
2. Minimize latency when your customers are in two geographic regions.
3. You need a backup of your data in case of accidental deletion.
4. You have compute clusters in two different AWS Regions that analyze the same set of objects.
5. Your data requires at least five nines of durability.
Your company requires that all data sent to external storage be encrypted before being sent. You will be sending company data to Amazon S3. Which Amazon Simple Storage Service (Amazon S3) encryption solution will meet this requirement?
1. Server-Side Encryption with AWS managed keys (SSE-S3)
2. Server-Side Encryption with customer-provided keys (SSE-C)
3. Client-side encryption with customer-managed keys
4. Server-side encryption with AWS Key Management Service (AWS KMS) keys (SSE-KMS)
How is data stored in Amazon Simple Storage Service (Amazon S3) for high durability?
1. Data is automatically replicated to other regions.
2. Data is automatically replicated within a region.
3. Data is replicated only if versioning is enabled on the bucket.
4. Data is automatically backed up on tape and restored if needed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3 Hello, Storage

Create new playlist

Sign In

Sign Up

Introduction to AWS Storage

Storage Fundamentals

Data Dimensions

Velocity, Variety, and Volume

Storage Temperature

Data Value

One Tool Does Not Fit All

Block, Object, and File Storage

Block Storage

Object Storage

File Storage

AWS Shared Responsibility Model and Storage

Confidentiality, Integrity, Availability Model

AWS Block Storage Services

Amazon Elastic Block Store

Amazon EBS Volumes

SSD vs. HDD Comparison

Elastic Volumes

Amazon EBS Snapshots

Amazon EBS Optimization

Amazon EBS Encryption

Amazon EBS Performance

Amazon EBS Troubleshooting

Instance Store

Instance Store Volumes

Instance Store–Backed Amazon EC2 Instances

AWS Object Storage Services

Amazon Simple Storage Service

Buckets

Limitations

Universal Namespace

Versioning

Region

Operations on Buckets

Create a Bucket

List Buckets

Delete a Bucket

AWS Command Line Interface

Objects

Object Tagging

Cross-Origin Resource Sharing

Cross-Origin Request Scenario

Operations on Objects

Write an Object

Reading Objects

Deleting Objects

Storage Classes

Amazon S3 Standard

Reduced Redundancy Storage

Amazon S3 Standard-Infrequent Access

Amazon S3 One Zone-Infrequent Access

Amazon Simple Storage Service Glacier

Vaults

Vault Lock

Archives

Maintaining Client-Side Archive Metadata

Using the AWS SDKs with Amazon S3 Glacier

Encryption

Restoring Objects from Amazon S3 Glacier

Archive Retrieval Options

Storage Class Comparison

Data Consistency Model

Concurrent Applications

Presigned URLs

Encryption

Envelope Encryption Concepts

Server-Side Encryption (SSE)

Client-Side Encryption

Client-Side Master Key

AWS KMS-Managed Customer Master Key (CMK)

Access Control

Using Bucket Policies and User Policies

Managing Access with Access Control Lists

Defense in Depth—Amazon S3 Security

Query String Authentication

Hosting a Static Website

Table of Contents for
Chapter 3 Hello, Storage