Reliability

The reliability pillar has a number of crossovers with the operational excellence pillar. We live in a 24/7 world, and users expect applications to be available at all hours, running at full capacity. The reliability pillar of the AWS Well-Architected Framework focuses on techniques and practices that can help you achieve a zero-downtime infrastructure.

When we speak of reliability, be it uptime for a web server or the durability of saved data, we usually speak in terms of nines. For example, Amazon S3 offers a staggering 11 nines of durability for stored objects. That's 99.999999999% durability, which is made possible by storing redundant copies throughout a region in various Availability Zones (AZs). Several services offer SLAs that guarantee a certain amount of uptime, such as Amazon DynamoDB, which offers five nines (99.999%) of uptime for global tables.

How exactly does that break down? Let's take five nines and see what this means in terms of time per week. In a single week, we have the following:

7 Days * 24 = 168 Hours * 60 = 10,080 Minutes * 60 = 604,800 Seconds

99.999% of 604,800 seconds leaves us with 6 seconds of total downtime each week.

You have probably heard many vendors bragging that their service has five nines or better, but in reality, it is extremely difficult to achieve. It means there are only 6 seconds throughout an entire week, or 5 minutes per year, that a request to your application will fail due to an outage. And this includes scheduled maintenance!

To achieve levels of reliability anywhere close to five nines, it's obvious that you are going to need a lot of redundancy, because software and hardware fail all the time. It's just a fact of life. When you are running thousands of servers, something is almost always broken. The only way to hide that brokenness from your users is to double, and triple-up on everything so that when the inevitable single failure happens, there is always a backup standing by to take its place.

And this brings us to another concept, called single point of failure. Many architectures take care to replicate common resources such as web servers, but what about the router? The firewall? The database? A chain is only as strong as its weakest link. Be sure to inspect every component of your system to make sure you haven't forgotten something that doesn't have any redundancy.

Deploying changes to an environment without downtime can be an extremely difficult problem to solve. For most traditional, monolithic applications, users had to get used to scheduled downtime for maintenance, be that a monthly OS patch update or a weekly new software build. The application is taken offline, changes are applied, and everyone crosses their fingers and hopes for the best when it comes back online. Software updates are almost always to blame when a complex software system has unexpected downtime.

Here are a few strategies that can help you to achieve zero-downtime updates:

Blue-green deployments: Spin up a copy of your workload that is behind an alternate URL. The test is thorough, and when it is ready, swap it out with the prior version so that the new copy (blue) is now live and the old copy (green) is no longer in use. If you see errors in the new deployment with live traffic, simply swap them again to roll back.
Canary deployments: Spin up a new copy of your workload and start by sending small amounts of live traffic to it. If there are no errors, ramp up traffic until the new copy has all the traffic.
Feature flags: Deploy hidden features that can be rolled out slowly and rolled back if needed via configuration at runtime.
Schema-less databases: Traditional relational databases often require downtime for significant schema changes, which are inevitable in any evolving application. Using a database with a flexible schema system, such as Amazon DynamoDB, can mitigate this issue.

The key to any reliable system is testing. Test everything, before and after it is put into production. Automate as much of your testing as possible so that your tests can be run quickly and efficiently every time a change is applied.

In the cloud, it's possible to run your tests with production amounts of data and traffic since it's inexpensive to spin up resources for a short-lived test. One of the key principles of the reliability pillar is to stop guessing about your necessary capacity. Use data to inform your decisions about the resources you provision.

Use auto-scaling for any resource that offers it, such as EC2 and DynamoDB. Configure thresholds for expanding and contracting resources so that you are always using exactly as much as you need for the current workload.

Finally, don't forget to analyze your dependencies. It doesn't matter how reliable your system is if it has a dependency on a separate system that is not reliable. If you can't control the reliability of the dependency, implement a queue or a batch so that the dependency is not in real time.

This pillar also crosses over with cost optimization, since a fully redundant system operating at five nines can be quite expensive! You might need to make some trade-offs to supply your users with a system that has acceptable levels of downtime, such as three nines (roughly 8 hours of downtime per year), in exchange for a more affordable service.

As we mentioned in several chapters in this book, service limits are often a surprising way to experience downtime and an excellent reason to maintain a Business Support contract so that you can quickly get someone on the phone to bump up your soft limits in case you reach capacity. Soft limits are there to protect you from accidentally over-provisioning resources and waking up to an expensive bill. But if you forget about them, they will cause downtime. Have a strategy for studying limits, raising them where appropriate, and monitoring your resources so that you know how much ceiling you have left at any given time.

Refer to the official AWS white paper for more information: https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf.

Table of Contents for Reliability

Create new playlist

Sign In

Sign Up

Table of Contents for
Reliability