Team

There is no point in monitoring or alerts if there is nothing that is done to remediate the situation. Once an alert has been triggered, a team needs to triage, debug, and resolve each alert. There are two objectives here:

  • Immediately bring the production environment to a stable state
  • Collect all data necessary (if necessary, take out an instance from the production cluster) to enable effective root cause analysis

The first objective is of the topmost priority. Debugging a down system in production causes outage extension and inefficient debugging.

To help in production outages, there needs to be step-by-step instructions on how to debug various situations. This is typically called an on-call runbook. For each alert, the engineer can consult the runbook to identify known causes of deviations from the norm, how to correct the situation, and how to debug/collect more information. These runbooks are for both infrastructure as well as for each service.

Traditionally, organizations used to have an operations team, which used to do things by hand. The runbooks described here were more manual commands to run. However, with increased scale, complexity, and feature-velocity, people realized that such a process does not scale. Most organizations are moving to the Site Reliability Engineering (SRE) team, which was first set up by Google. An SRE team is a team of engineers who use software tooling to manage all the software and infrastructure of the application. Effectively, the runbooks are automated, so that actions previously done by hand happen automatically.

The SRE team is generally complimented by a set of "on-call" developers from the individual service teams. These on-call developers are responsible for their services in production, and do the detailed debugging and L2 support. They work very closely with the SRE team during production incidents.

While we described a quick introduction of how a DevOps team can be set up, there is still a lot of details to work out on. The details are, however, out of scope for this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset