Operational excellence

The operational excellence pillar is often overlooked, and even in companies that have a good grasp of architectural principles, this is an area that needs improvement. The key to operational excellence is to design your systems with operations in mind from day one. Just like with the security pillar, where we must not wait until after the application is built to consider it, we have to make sure what we are constructing is operable.

When you add a new feature to your application, ask yourself these questions:

  • How will I know if users are using the new feature?
  • What are the Key Performance Indicators (KPIs) that tell me if the feature is working well?
  • Are there any alerts that I should send to the operations team if this feature is not working as intended?
  • Are there automated tests in place to make sure new code changes do not break this feature?
  • Is the feature documented thoroughly for both internal and external users?
  • Is the data that's generated by this feature available to a reporting team for detailed analysis?

A feature is not done when just the code for the feature itself is complete. All of those aspects must be considered before calling it done and shipping it to production.

Another critical aspect of operational excellence is to adopt Infrastructure as Code (IaC). In Chapter 1, AWS Fundamentals, you were introduced to AWS CloudFormation, which allows you to construct your resources using JSON or YAML files, instead of logging in to the console and creating them manually, or via the command-line interface (CLI). IaC allows you to rapidly spin up new environments for development, testing, and disaster recovery. Treating infrastructure as code also allows you to adopt best practices from the software development industry, such as revision control and code reviews.

When you experience an operational failure, don't sweep it under the rug and pretend it never happened. Write a detailed report about exactly what happened, why it happened, what the root cause was, and how you solved it. Document the measures you took to make sure it never happens again, and then share that data with everyone in your organization. We all fail at some point. All systems, both human and machine, are fallible. Expect failure, prepare for it, and learn from it.

One of the best ways to prepare for operational events is to conduct what is called a game day. Take a copy of your infrastructure, assign your operations team to monitor it, and then purposefully throw them curveballs to see how they react. What happens if you stop a critical EC2 instance? How will they react if you misconfigure a security group and your web servers are no longer accessible? Can they recover from a faulty software release? The only way to know is to test your team frequently, document the results, and make changes to improve your performance.

There will come times when you are facing a problem that you can't solve alone. AWS support is a fantastic resource that you should take advantage of to help you solve problems with your infrastructure.

At a bare minimum, purchase a Business Support contract on all production accounts!

Business Support gives you full access to AWS Trusted Advisor and is an invaluable resource that can identify cost savings so significant that it more than makes up for the support fees. And even more critical than that, a Business Support contract gives you a very short service-level agreement (SLA) so that when you need help, you can get someone on the line quickly.

The most common support call you will make is to increase your service limits. AWS sets soft limits on many resources to protect customers from accidentally provisioning too many resources and waking up to a huge bill. When you hit those limits, your application can grind to a halt. Without business support, you might not be able to meet the SLA with your customers to get them back up and running in an acceptable time period.

Operational excellence is the key to providing your customers with a smooth, consistent experience. It's an area where you are never really done – there are always more things to learn and improvements to make, so keep studying the best practices and keep looking for ways to enhance your operational readiness.

To learn more about the operational excellence pillar, see the official white paper: https://d1.awsstatic.com/whitepapers/architecture/AWS-Operational-Excellence-Pillar.pdf.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset