Blameless postmortems

A topic we have touched upon is the key adjective usually attached to postmortems—blameless. Postmortems used improperly can lead to really dangerous and toxic cultures, where people are afraid of outages or feel incapable of doing their jobs because of the fear of failure. Start with a fact—we all fail. As humans, failure is one of our most defining traits. One of the key tenets of SRE is that if you are going to fail, set up your organization so that it is not a big deal and you can quickly recover and learn from the failure. Another tenet is that reliability is a team sport. Everyone needs to feel comfortable raising issues, responding to outages, and working on improving the system. Fostering that level of psychological safety is difficult and constantly needs reinforcement and help.

Note that, in the Root cause section, we did not point out the person who made the change that broke the system, nor the person who wrote code that couldn't handle the user interaction we had. On top of that, no one got fired, no one was yelled at, and no one was blamed.

We need our employees to feel comfortable to take risks, and also that they can comfortably fix things when they fail and talk about why they failed. This sounds simple, but when an outage causes major damage to an organization or business, it is not a trivial thing. If you are a manager, you may be put in the position of needing to explain why you shouldn't fire a person. You may even be asked to provide a sacrificial lamb. I used to think that this situation only happened in movies, but often an organization wants to be able to blame a problem on a single person. Stay in the industry long enough and you will hear stories of people doing all sorts of things so that they can keep their jobs and claim to a customer that something will never happen again.

If you're having trouble not blaming a person, I have some tips. First, go through your document and replace all names with "we," then start going through the timeline. Instead of blaming the human, figure out every step where a tool could have prevented a human from making a mistake. Could automation save a person from skipping a step in a checklist? Is there a checklist? It is well known that checklists save lives (this was discovered by medical and aeronautical professionals years ago https://ti.arc.nasa.gov/m/profile/adegani/Cockpit%20Checklists.pdf), but as computer programmers, we can often turn processes into code. The reason we do this is that there will come a day when someone doesn't even know that the checklist exists. So, instead of requiring the human to do something, figure out where the human shouldn't have had to do anything.

Note

The importance of the checklist is often connected to the large crash of a Boeing plane in 1935, where the crew forgot essential steps in preparing the plane for take off. After implementing a checklist for take off for the large B-17, there were no other incidents for 18 million miles. Robert L. Helmreich, Ph.D. is often considered the reason that checklists have continued to gain such reverence. He wrote extensively about how they lower accidents in hospitals and aeronautics in the 90s.

Next, ask whether your service had trouble dealing with a dependency failing. Cascading failures are one of the most dangerous types of failure in service-oriented architectures. You can often recover from one service failing but recovering from all of your services failing is much more difficult. An example was a chat service that Google was running. It had a bug in its retry logic, so if the mobile service was down for long enough, the retry polling started lining up. All of a sudden, there were millions of devices polling servers at a synchronized time. This increased traffic, causing other services to go down as the ingress network became overwhelmed. These outages continued until they were able to shed some of the load to deal with increased traffic. There were lots of issues with the mobile app, the chat service backend, and the generic frontends, which weren't ready for that kind of traffic pattern. Any one of these teams could have pointed the blame at each other.

I should mention that I was not on any of these teams; this is just one of those stories that gets passed down. The goal is that all pieces that had an outage or failure should change their behavior. At Google, the frontends needed a way to shed traffic faster, the backends needed better testing to prevent code bugs ending up in production, the mobile clients needed to randomize their connection retries, and some services needed to remove dependencies they didn't even know they had.

The point of all of this, though, is that people need to feel safe to raise issues. People in the organization, no matter how large it is, need to work together to fix issues. Everyone needs to feel safe to fail and safe to discuss their failures with others. If people don't feel safe, they will often not be honest or may try and hide things. Even worse, they may just leave.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset