How to Panic in a Controlled Manner - On-call for the Modern Age

Daniel Kapit,4 min read

A Tale, Far Too Common

You’re sleeping soundly when suddenly, out of the dark, an alarm sounds. You roll over to your blindingly bright phone screen which blares the time: 2:48 AM. Below the clock, a single notification reveals the source of your awakening—PagerDuty, warning you that there’s been a TypeError: unable to get property 'id' of undefined.

When you open your laptop to acknowledge the alert, you’re given little more information than that obnoxious notification provided. Once more, you must dive into your un-typed, vanilla JS codebase to figure out if you’ve been woken up for a true production issue, or because someone forgot to do a null check.

Sadly, this happens far too often in the world of software on-call. Eliminating these unacceptable pages and establishing good alerting patterns are the first steps to creating a maintainable incident response culture.

The Root of the Problem

Bugs are a ground truth of software engineering, and the plethora of techniques for removing them speaks only to how rent-free they live in developers’ minds. So why is it that even with broad adoption of testing methodologies, build pipelines, mocking libraries, type systems, and more, we still see production incidents?

In my experience, the answer to this question lies in what are often the first decisions made by the designers of a new piece of software: the programming language and supporting data model (including choice of database).

Programming Languages as an Incident Mitigation Tool

If you asked any handyman why they used a hammer to drive in a nail, they’d look at you like you’re nuts. Why would you use anything else?

While the details may be far more complex when choosing a programming language for a production system, the idea is the same: you need a tool that appropriately solves the problem at hand, while preventing any problems in the future. The second part of this statement is left out far too often. Not only will thoughtful language choice actively prevent the accrual of technical debt, it will help determine how software is designed, laid out, built, deployed—the list goes on. Planning ahead any one of these details decreases the risk of a production incident down the line.

An example: say you have a team of entry-level engineers—smart folks who just graduated from a school primarily teaching Java—and the first project that will be assigned to this new team is going to be related to streaming critical, user-facing data. In the first design meeting, one of your long-time engineers suggests to these fresh-faced new hires that “hey, you guys should use Clojure! It’s functional, and really good for data streaming.”

While this engineer may be right about those properties of the specific language, there are a couple of lower level details that should influence the final decision. Perhaps the most important: Clojure is dynamically typed. By choosing a dynamically typed language, engineers lose the ability to check a subset of code correctness simply by running the language’s default compiler, and instead depend on runtime correctness which is much more fickle (and much more likely to throw errors when running in production). Your already Java-familiar engineers would be better off using something like Java, Kotlin, or even Scala (which shares functional properties with Clojure). Not only would this provide them with a strong type system that checks their code at compile-time, but they’d be working with languages and tools with which they are already familiar, providing them a leg up if something does go wrong at 2AM and they need to ship a hotfix. Considering all of these questions when choosing a language for new software can significantly decrease the risk of an incident, and can make it much easier to ship a fix when one does arise.

Incident-Proof Data

The second piece of the puzzle is just as fundamental: data.

Designing to Prevent Incidents

The same way an architect designs a building to withstand strong winds or an earthquake, so must software engineers design their software to withstand system failures, data issues, or raw human mistakes. I’m not going to dive in to the complexities of distributed system design and the various types of fault tolerance—there are many academics far smarter than I who’ve written papers on the subject—I’ll simply say this: plan ahead, and consider how, in each step of the design process, you can avoid incidents (preventable or otherwise) or make it easier on the poor sap who has to fix them.

Reacting to Incidents

Even with all of the planning in the world, incidents arise. In the following sections I will outline the key actions and plans that should be performed and made for responding to an incident, but I will not go into detail—that is reserved for future posts. When each is written, I will link it in the related section below.

On-Call Rotations

Incident Communication

Post Mortems

© Daniel Kapit.