How To Make On Call Less Harmful

Last week there was an interesting conversation on Twitter about “on call.”

This is all part of the ongoing transition to DevOps, which requires tearing down the wall that has historically existed between developers (people who build things) and operators (people who keep things running). A big part of the cultural difference between the two is that development is a more bounded job: you code, and then you’re done and you can relax until it’s time to code again. Operations is more of a 24×7 thing: if the website goes down at 3am, someone is going to get paged to fix it right then.

As DevOps, site reliability engineering (SRE), and related concepts start to break down those clear distinctions, developers are being required to “pick up the pager” and be available on call. As usual, though, it’s not quite that simple. Developers often react badly to being asked to take part in Ops tasks.

Charity Majors, who has been one of the leading voices in this conversation, shared this illuminating tweet:

All this heated talk about on call is certainly revealing the particular pathologies of where those engineers work. Listen:

1) engineering is about building *and maintaining* services
2) on call should not be life-impacting
3) services are *better* when feedback loops are short

— Charity Majors (@mipsytipsy) February 10, 2018

I could not agree more. The way software architectures are structured today, and the speed at which they evolve, mean that we cannot rely on long release cycles that include documentation and troubleshooting guides for use by Ops teams. The only way to be able to fix problems fast is to have the developers of those services go out on the front lines of responding to problems.

There are human and process aspects to this transition, as Charity went on to note:

and tbh your tools suck.

i've been on teams that were sucked dry by horrible on call experiences. I've also turned them around. I've had people *ask* to be on my on call rotations because they learned so much and were fun and enjoyable. 😛

aim higher. don't settle.

— Charity Majors (@mipsytipsy) February 10, 2018

It’s always healthy when a discussion among technologists does not immediately move on to offering one-size-fits-all solutions. Cindy Sridharan followed up some of the Twitter discussion in an instructive blog post with the descriptive and hopeful title of “On-call doesn’t have to suck.” I highly recommend reading the whole thing, but this list of reasons why being on-call has often been a bad experience for people really resonated with me:

On-call can “suck” for a plethora of reasons (and when these repeatedly occur during multiple on-call rotations without being prioritized to be resolved, burnout is inevitable):

noisy alerts
alerts that aren’t actionable
alerts missing crucial bits of information
outages caused due to bad deploys owing to bad release engineering practices and risk management
lack of visibility into the behavior of services during the time of an outage or alert that makes debugging difficult
outages that could’ve been prevented with better monitoring
high profile outages that are due to architectural flaws or single points of failure
outdated runbooks or non-existent runbooks
not having a culture of performing blameless postmortems
lack of accountability
not following up with the action items of the postmortem
not being transparent enough internally or externally

Many of these are human and organizational problems; no tool out there can help you with cultural problems such as not performing postmortems, or not acting on the findings when you do. However, other aspects can be addressed or at least mitigated with better tooling.

Let’s take a step back to examine some key assumptions of IT Operations.

Assumptions Of Ops

You Can’t Just Hire More People

Human expertise is rare and expensive. Even if you could go out and hire another ten experts in your platform tomorrow, it would still take time to get them up to speed on your own processes, help them find the coffee machine, walk them through the code repositories and point out all the idiosyncrasies, and so on. Call it six months or so before they are up to speed, and allow for a few to wash out of that process along the way (they get a better offer, something unexpected happens in their personal life, or it’s simply not a good cultural fit).

People Can’t Just Work Harder

Once you have the people, you can only usefully work them so hard. Sure, there’s crunch time, but you can’t operate in crunch mode for months at a time without burning people out — and then you’re back to hiring, but with the disadvantage that your current team is demoralized and perhaps also further under strength due to departures.

Alerts are Expensive (Somewhere, Somehow)

If you wake up a senior developer at 3am over what turns out to be nothing at all, you just incurred a significant cost. There is of course whatever direct cost your organization associates with pages (at many shops, that’s time-and-a-half or even double-time, with a minimum call-out floor too). On top of that, you have the indirect cost of that person being distracted and less productive due to lack of sleep.

Delegation is Also Expensive

Okay, so you think you can fix that by only putting junior developers on your on-call rotation. Here’s the problem with that: If you let senior developers opt out, all you achieve is to shift the work around. Somebody has to write the knowledge base or the run book that the junior will follow when the page comes in — and that has to be a senior developer because they are the only ones with the knowledge to do so.

Automation Won’t Save You

Automating incident response is a variation of this case: Someone has to write the script or automated run book, and on top of that, figure out the conditions under which that automated action should be triggered. This is one of those Pareto Principle tasks where you can take care of a big chunk of your incident response by automating a few frequently-occurring cases, but then there is an enormously long tail of much rarer issues that provide rapidly diminishing or even negative returns from automation.

Filtering is also a major effort

Right, you think, but what if instead you filter alerts to make sure that only good, relevant, and actionable alerts get sent? Actually, that’s even worse! First of all, someone has to build and maintain the filters, and once again, that has to be a senior dev with the domain expertise to know what is worth waking someone up over or not. That judgment changes and evolves over time as the application changes, so it’s never something that’s “done,” it’s a never-ending task — the very definition of technical debt. Secondly, organizational biases get built into the filters, so either the alerts err on the side of safety and send more, possibly irrelevant alerts, or they err on the side of letting devs sleep and risk not alerting on real issues.

Complexity will always bite you

The final, and most insidious, aspect to consider is one that is so built-in to the way many organizations think about operations that is often not even stated explicitly: that each alert should become an incident that is investigated by one (and only one) team. A slowdown in a certain user transaction that is timed by an APM solution should be routed to the application support team. Meanwhile, an overloaded network segment is already being investigated by the netops team. Two different engineers get paged, on opposite sides of town, and start investigating their own issue — not realizing that each is only part of a wider incident that is affecting the service.

AIOps Makes On-Call Better For Humans

Moogsoft is built on an intimate awareness of all of these issues. Our entire platform is designed around the idea of providing useful and actionable notifications in real time to the relevant people across all the different affected teams — and not waking anyone up unnecessarily.

Moogsoft AIOps acts as a filter between event sources and the people who are on call. We use algorithms and data science to identify what is even a significant event, whether it comes from physical or virtual infrastructure, from a performance metric, or from a log file somewhere. We then use a second set of algorithms to look for correlations between those different data sources, to make sure that we are not alerting a bunch of people unnecessarily for what turn out to be symptoms of a different issue. Finally, we provide the unlucky person who gets the call with all the tools they need to do something about — including directly on the mobile device where they received the notification in the first place.

This is not a magic wand to solve all your on-call problems, but technology rarely provides these magical fixes. However, it will go a long way to help with putting a fix into place, coupled with the sorts of cultural changes that others have detailed better than I could. In the longer term, using AIOps will also provide the data to influence further changes to align operational processes better to the needs of the organization, and of the people who are part of it.