amcneil36

Oncall

Motivation

It’s 9am on a Saturday and your app randomly stops working. Users trying to open up the app just have a blank screen with some spinning icon. People are taking to twitter to complain about your app. An hour later, TMZ’s front page has an article about hackers stealing user data by the second. Your company’s stock price is down 12% after hours.

But your employees only work Monday - Friday 9am - 5pm. So you guys start working on the fix for the issue two days later after 5 million more user’s data has been stolen, 6 more TMZ articles have been written, and the stock price dropped another 30%. Furthermore, there is now a lawsuit due to how slow your company responded to the hack.

Okay. So this was an extreme example that isn’t representative of how a company would typically react to this situation. Normally, when something really catastrophic is happening, a company is going to try and fix the issue as soon as possible, even if it’s not doing the normal work hours. But if the issue is something that is really minor, it’s not going to be worked on outside of the normal work hours. In an ideal world, nothing would ever go wrong and no employees have to work outside of their regular work hours. But, unfortunately, sometimes something catastrophic can happen and we need someone to be available to help do the fix. It’s sort of like when your power goes out or when a fire randomly starts near your residence. For scenarios where something catastrophic is happening, you really need someone to be there who can do what is needed to fix your issue ASAP.

Oncall overview

This is where the oncall comes in. Now, there’s not typically going to be one oncall for the entire company, unless it’s a tiny company. There’s typically oncalls for each type of area. So depending on what the issue is, that determines which person is responsible for fixing the issue. There should be documentation for which team/department owns which component and which oncall rotation is serving that component. There might be a triager who reads through the documentation and uses it to determine which oncall to contact when something has gone wrong. Or, even better, there may be some alerts in place that notify the appropriate oncall person whenever something has gone wrong. For example, there may be an alert configured to ping the login oncall whenever the number of login failures are higher than expected. This way, the correct person is notified right away without having to wait for a user to file a bug report.

Should the oncall primarily fix issues or triage issues to someone else?

Should the oncall try and fix all of the issues that come in or should the oncall just be a triager who re-routes them to someone who might have more knowledge that is not currently oncall? It’s usually a little of both but it depends on things like whether the issue is discovered during or outside of work hours, the severity of the issue, how much knowledge the oncall has vs others who are not oncall and what the bandwidth of the oncall is vs what the bandwidth of others is.

If it’s a severe issue and is discovered outside of work hours, usually the oncall would work on the issue because people who aren’t oncall are not expected to be available outside of work hours. Whether an oncall even works on an issue outside of work hours depends on the severity of the issue. If it’s a severe issue and is discovered during work hours, the issue should be triaged to whoever is thought to be the one who can resolve the issue the quickest, irrespective of whether they are oncall. If the issue isn’t severe and the oncall is only slightly less knowledgeable about the issue than someone who is not oncall, then I generally prefer the oncall person to work through the issue.

Depending on how hectic the oncall shift is, it can be the case that the tickets come in faster than the oncall can triage or fix them. If this keeps up, there could be a really severe issue that comes in and the oncall may not have it triaged or fixed very fast due to not having bandwidth. It is important that all tickets that come in have at least some initial thought given to them right away to determine the severity so that it can be prioritized appropriately. So if the oncall is having a rough day and tickets are coming in faster than they can be triaged, the oncall should take the tickets and split them up among the other teammates who are not oncall and those teammates can help triage or fix the issues. Ideally the team would have retros to discuss and improve oncall health to where this doesn’t happen though.

Note that not all oncall tickets that come in need to be fixed ASAP. Some issues are false positives. Other issues are true positives but not very severe of an issue. If a ticket that comes in is an issue that is deemed very low priority, it would be okay to put it in the backlog and get to it another day, week, or month depending on the severity.

How long should an oncall shift last?

Some teams will have oncall shifts last a few weeks or a few months at a time. Others will have oncall shifts that last one week at a time. I have always liked one week oncalls the best since that is typically around the duration of a vacation and people can request to swap entire shifts as needed. With longer shifts, you’d typically need to swap partial shifts. The one week oncall also makes it so that I don’t get too out of the loop of what other things are going on at any time.

Does the oncall only work on bugs?

Some teams will have their oncall only work on issues that come in during oncall or bugs from the backlog (if there are no active issues from the current oncall). In this scenario, the oncall never spends any time doing enhancements. I don’t personally like the idea of an oncall person being required to work on bugs only for the duration of the oncall shift. I think that if the oncall is blowing up with tons of issues, it may be that those issues consume all of the oncall’s bandwidth and that’s all the oncall works on. That’s fine. But if the oncall situation is looking calm, the oncall doesn’t have to work on a bug from the backlog just because they are oncall. The product owner should have a priority for the enhancements and bugs. If an enhancement is deemed higher priority than a bug in the backlog, then the oncall should work on the enhancement. Otherwise, the oncall should work on a bug from the backlog. But, with that same logic, someone who is not oncall ought to also consider working on a bug from the backlog too. The oncall should be more for new issues that come up during the oncall shift and not so much for having to tackle all of the old issues that came up during previous oncall shifts.