SaaS War Games - Part 1: Getting Comfortable with Being Uncomfortable

It’s 3 AM and your phone is ringing. There’s only one number you let ring through your Do Not Disturb settings. You open one eye and look at the first of 12 on-call notifications.

Database down. Need help.

It’s gonna be a long night.

If you’ve ever been part of a SaaS operations team, you’ve probably been through a night like this. The good news is that you made it through. The bad news is it could happen again tonight. Or tomorrow. Or the next night. What if the incident is a new issue? What if you don’t know how to solve it? What if nobody knows how to solve it?

This three-part War Games series won’t help you solve your specific production incident. Instead, it will help you define a strategy that works for all production incidents when combined with domain knowledge of your system.

2021-05-01-owl-chicks-D500-0068.jpg

“Database down. Need help.”

It’s gonna be a long night

Being in an Incident

Being in an incident sucks.

Alerts are flying, executives are demanding status updates, customers are sending in support tickets, and in all this, you’re supposed to be making progress on the incident? How many incidents does it take to feel comfortable in this environment?

The key is never to get comfortable with an incident. Being comfortable with an active incident removes the urgency. Being comfortable means active incidents are ordinary. They should never feel normal. Instead, operators need to get comfortable with being uncomfortable.

Comfortable Operators

I go as far as believing operators that are comfortable with incidents are a liability for the platform. When the only thing between your system being down in the middle of the night and responders actively rescuing it is an overly comfortable responder who thinks this is a “regular alert,” then you’ve got more significant problems with your operations organization.

There’s no such thing as a regular alert. There are only two categories:

  • Good, real alerts

  • Alerts that need tuning immediately

It’s so easy to get used to bad alerts.

“That fires once a week.”

“That usually clears in 10 minutes.”

“I’m not sure what it is, but it auto-closed.”

All of these are clear broken windows that will muddy the waters between real and fake alerts. Once your organization loses trust that all alerts are real, everyone is liable to skip an alert or two. This isn’t a failure with your team members; it’s a failure with your process.

Clean up these alerts: they’re tech debt. As tough as it sounds, whenever an alert fires (which should be as infrequently as possible), the on-call responder’s stomach should lurch. That means they trust it’s real.

Calm Doesn’t Mean Comfortable

The key to production incidents is quelling the chaos and staying calm while feeling uncomfortable. Evaluating the incident, communicating clearly, and taking action can be done quickly and calmly, just like exiting a building during a fire. The urgency is still there, pushing you forward, keeping the pressure on, but it doesn’t cripple you: it drives you.

So how can you practice this? How can you make the team uncomfortable? How can you test the team’s response to the pressure?

In the following articles, we’ll dive into the solution: War Games.

 

If you’re interested in learning more about War Games or would like help running your first one, please send me a note at brian@connsulting.io or schedule a time to chat at https://calendly.com/connsulting.


Related Content

Previous
Previous

SaaS War Games - Part 2: War Game Basics