Recovering the system

One thing that we have yet to touch on is how to debug your application and do the technical work of responding to an incident. This is the third pillar of the initial three pillars we mentioned when defining incident response. We were alerted that things were not great. We communicated that we were on the case. Now we need to make things better.

How do we do that? We will be talking about measuring mean time to recovery (MTTR) in Chapter 4, Postmortems, but the strategy that we kept mentioning earlier in this chapter was bringing the system back to a working state. That's because you don't necessarily want to immediately go into bug-hunting mode. Instead, you want to find what has changed in the system and revert back. Let us walk through the common first steps in trying to track down a broken system.

Step zero is to take a deep breath. Force yourself to slow down a little. I prefer to count to six while inhaling, count to six again while holding my breath, count to six again while exhaling, and finally count to six before beginning this loop again. Doing this just a few times causes me to relax a little. I am still freaking out because I just got paged and everything is broken, and I do not know what is going on, but breathing slowly forces me to check myself before I wreck myself, as they say.

Note

This method of breathing is called box breathing. Apparently, it comes from the US Navy SEALs. How you calm down doesn't matter because the key is to just find a way to calm yourself a bit, so you can think clearly.

Once I have calmed down, the first thing I always check is if there was a deploy recently. A graph that I have had in past jobs is a percentage of servers running a version of our application. With that, you can see the number of servers running an old version decrease and the number of servers with the new version increase.

If there is a new version appearing on the graph, then code was probably recently deployed. You can then try and correlate whether the errors you are seeing match with the appearance of a new version. If that is the case, then you should revert the deployment and roll back to the previous version.

Note

NOTE: We talk about deployments, rollbacks, and related concepts in Chapter 5, Testing and Releasing.

If the application didn't change, or if the rollbacks did not fix the issue, the next area to look at is whether our environment has changed. Have our inputs changed? Have we started receiving a large amount of traffic? Are we being sent a bunch of traffic that is nonsense or corrupted? Do we need to start blocking a certain IP address or user-agent while we figure out how to change our application to handle this type of bad traffic better? Is this valid traffic? If so, can we increase the number of things we have running to increase the number of resources that we have to deal with this increase in traffic?

Note

Dropping, or stopping, to respond to traffic to a web server that matches some sort of filter is called load shedding. It is often viewed as an aggressive tactic, but it is very useful when a certain category of traffic is causing your application to fall over.

If user inputs have not changed, then maybe one of our dependencies has changed. Is our database up? Are services that we depend on running? Have they changed, without telling us, in a way that our application is not compatible with?

Often, external dependencies will publish public status pages, or post on Twitter, announcing that they are having an outage. You can also message their support emails, or Twitter, if you have reason to believe they are down but are not publishing an outage on their status page (or they do not have a status page).

Very rarely does this list of areas to check (code change, input change, and dependency change) result in finding something that you cannot quickly change and bring the system back to a healthy state. In cases where you cannot find something, it is a great time to call in other coworkers and start digging deeper. Start checking everything, finding graphs that look unusual, starting to dig through the logs, and exploring weird theories. Remember, this is not meant to be a permanent fix but more of a way to bring the system back to a place where most customers can use it. Then you can let your customers keep using your service, while you work with your team to come up with a longer-term plan to prevent this from happening again.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset