The 5 Stages of a Production Incident
Here’s a bit of a paradox: the better you are at solving SaaS production incidents, the harder each incident is to solve.
At first glance, this doesn’t make a lot of sense. Wouldn’t being better make solving production incidents easier? No. The trick is that once you get good at production incidents, you don’t get hit with the easy ones anymore: you solve them for good. That leaves only the new and challenging problems for you to solve. The average incident is more complex, but your reward is that the frequency of incidents goes way down.
I’d take that trade any day.
SaaS War Games - Part 3: Running a War Game
Planning is critical to running a successful War Game. One of the core goals is for the incident to feel real, so expect to spend 3-4x the amount of time planning the War Game as you spend running it.
SaaS War Games - Part 2: War Game Basics
In the first article of this series, we identified a few challenges of production incidents. They’re fast, filled with pressure, and are (hopefully) brand new failures. If the best-case scenario is a new failure (remember: repeated failures mean we never solved it the first time), how can we practice?
SaaS War Games - Part 1: Getting Comfortable with Being Uncomfortable
It’s 3 AM and your phone is ringing. There’s only one number you let ring through your Do Not Disturb settings. You open one eye and look at the first of 12 on-call notifications.
“Database down. Need help.”
It’s gonna be a long night.