Being on call

On-call life can be very stressful. The stress comes from not knowing what is going to happen, plus not being able to work on the projects that you actually enjoy. To make it a better experience for those involved, the following are some tips and general policies for keeping everyone happy:

  • When deciding who to alert, do not alert everyone unless you have less than three people to alert. If you alert everyone, then you are alerting no one. I say less than three because with one or two people, in a rotation that has a primary and a secondary, you both are always on call anyway. You can do a rotation, but it might be just as easy to alert you both. Once you have three people, then one person can have time off.
  • Make sure there is an on-call schedule and people know when they are the on-call person. People should know how long they will be on call and when they next need to be on call. Usually, it helps to have the next three months of on call scheduled, so that people know if they need to be available and they can schedule life and work time.

    Not having on-call scheduling ahead of time can be frustrating for employees and tends to mean that people take less vacations because they don't know when they will be needed at work.

  • Provide people with a backup when they are on call. As humans, we have lives outside of work. If an on-call has a backup, then they can feel less stress about spending an hour on the subway without cell service or not having coverage when they see a movie. It is completely possible to have a single person on call for a service, but having a backup provides that person with support and makes sure they aren't alone.
  • Create an explicit escalation policy for on-call folks (and alerting systems) to follow. If there is only one team, that is usually the on-call person, their backup, and then the rest of the team. If there are multiple sites (for example, one in Asia and one in the Americas), then often, after the backup, the on-call person at the other site is alerted before everyone else. This is important so that if a person is not available, the alert system has someone else to contact. For example, I take the subway to work and it doesn't have Wi-Fi or cell phone reception. During my commute, I am not contactable. As such, if an alert fires during this time, the system won't be able to reach me and instead it will try to reach my on-call partner. If they are asleep or also on the subway, it alerts the rest of the team.
  • Escalation policies for other teams should also be visible. This isn't that valuable for very small companies, but once you have multiple engineering teams, making it easy for one team to send an alert to another team and land on the correct person is incredibly important. It is important because a team designates an on-call person and that person is aware of the current state of their software and is also set up to be responsive during that time period.

    If you were to message a random person on a team, instead of their dedicated on-call, you might not get an answer or you might get an incorrect one. If someone reaches out to you and you are not the on-call for your team, you should probably redirect them to the correct person and also send them the information of where to look so next time they escalate to the right person.

  • Define what a person's responsibilities are while on call. Are they expected to still do normal work? Do they need to travel with their laptop and a wireless internet connection? What is an acceptable response time to alerts? Do they need to write up any documentation after an alert or incident? Also, make sure the hours and times of responsibilities are addressed. For example, some services have different SLOs during business hours and at night. So maybe an on-call person can mute all alerts that happen between 10pm and 6am, or maybe the allowed response time is longer at night. If there is a team in a different time zone, as we mentioned above, the on-call may need to be less available at night because they go from being the primary to the tertiary on-call because the primary is now someone else in another time zone.
  • Compensate people more for being on call. That is, if possible, pay people extra for being available outside of traditional work hours. My general philosophy is that you should get a 15% pay bump while on call. This number is pretty arbitrary, so could be anything, but 15% assumes that you're working six hours extra over the normal 40 hours per week in the US. This pay increase is not always feasible and I have seen managers find all sorts of ways to compensate employees for their time on call. One friend gets a day off for every ten days he is on call. Another friend gets extra stock options. A third friend, at a very small start-up, just gets taken out for a very nice dinner once a month because she is always on call.

Adding new people to a rotation takes time. Make sure that you provide people with support as they are ramping up to being on call. Let someone shadow a primary on-call person to see what receiving an alert is like, without having to do the response work. Having a seasoned secondary around when someone has their first primary shift is great; just make sure that the seasoned person doesn't try to do things for the new primary person, unless they ask for help. Often, a more seasoned developer tries to do things for someone while they are trying to learn because they get frustrated with the newer person going slowly or doing things slightly differently. You can also use tools like Wheel of Misfortunes and DiRT to train new people before they join a rotation. We talk about these practices in Chapter 5, Testing and Releasing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset