Why write a postmortem?

In the previous chapter, we talked about incident response. When responding to an incident, we mentioned that you need to focus on bringing the system back to a healthy state as quickly as possible. This need often prevents you from finding out the root cause of the incident. The writing of a postmortem document is the right time to figure out what happened. How did the process die? What part of the system caused instability? How long after the incident began did we notice this? Why did other systems fail?

We carry out a postmortem separately from the initial incident so that we can be thorough and meticulous. We must make sure that we have all of the data and that we fix the issue entirely. Often, during an incident, adrenaline is flowing and quick gut decisions are made. This is because there is very little time to think and weigh decisions. If we do the analysis and research afterwards, we can talk to more people, the stress of the outage is not upon us, and we have more time to make a calculated decision. From this analysis, we can decide to create a document to summarize our findings and share the incident with our teammates. To write a postmortem, analysis is required no matter what. You need to know why something failed. Choosing to create a document is optional and we will cover that decision next, but a report will be useless if you do not analyze the incident and figure out what happened.

A postmortem is also a historical document. It is very likely that you will not be working for an organization forever and that you will have people in your organization who were not involved in the incident or the service that broke. A postmortem document lets you record what happened, how you resolved the issue, and what your team learned. You can then share the document and archive it so that people in your organization can reference it and learn what types of actions your team took to respond to this sort of incident. This sharing will not be as useful if your organization does not have a transparent culture. Personally, I think that I learn most by trying things and reading, so I find the act of creating a postmortem very useful. However, an organization will not be helped by a document if you cannot share what you work on between teams. I have never worked for an organization like this, but they definitely exist. A culture of transparency also tends to promote trust. If you are able to see the work others are doing (and how they handle emergencies), you're more likely to trust them to do the right thing and also value their advice in the future.

As already mentioned, often people will use postmortems for future reference, for example to help them to decide future priorities or planning. By having an archive, you can see whether a particular type of problem happens across multiple pieces of software that your organization runs and decide to work on a unified fix. It is said that you should automate yourself out of a job. Postmortems provide that at an organizational level, as you remove the burden of having to remember the precise details of what went wrong and can move forward onto other things while still improving organizational knowledge.

Outside of postmortems about incidents, you can also write documents about events. These tend to be a little more freeform but provide the same benefit. An event can mean something like a product launch, a significant new feature, a complex integration, the changing of vendors, or something similar. Documenting a significant business event lets you point to past events and reference them when making future decisions. This documentation can prevent future teams from implementing a new feature in the same way or from trying the same product strategy over again. It can also help you plan for similar events based on how your team dealt with past ones.

You can also write a postmortem for the public. Often, these documents are for internal usage, but sometimes companies will publish a version of the postmortem for the public or for customers to promote trust. If one of your vendors explains why it couldn't deliver your emails or process your requests, you are more likely to trust it in the long run than if it has an outage and does not acknowledge that it happened. This transparency is a mutual thing. You want your dependencies to explain why problems arise and how they are fixed. These outages and their responses go into your evaluation of the dependency and whether you should keep using it or replace it. The same is true for your services. Someone is using them and sometimes that person is technical and other times they are not.

Providing them with assurances that you are doing your job and that they should not expect this type of outage again will improve their trust in you and will help lessen the hesitation when a customer is using or paying for your service. In my opinion, the more transparent the company, the better. That being said, sometimes you cannot share everything because of legal or financial restrictions.

Besides trust building, sometimes there is a responsibility to publish a postmortem because a service broke a published SLA. A classic example of this was when Amazon, which traditionally does not publish detailed postmortems, wrote one when it had a very large S3 outage in 2017 (https://aws.amazon.com/message/41926/). Outside of public and private documents, there are also documents that are shared between businesses that have service contracts or relationships. Sometimes an organization does not want to publish its postmortem to the public, but it is willing to send you a detailed document because you are an important customer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset