Connsulting

View Original

The 5 Stages of a Production Incident

Here’s a bit of a paradox: the better you are at solving SaaS production incidents, the harder each incident is to solve.

At first glance, this doesn’t make a lot of sense. Wouldn’t being better make solving production incidents easier? No. The trick is that once you get good at production incidents, you don’t get hit with the easy ones anymore: you solve them for good. That leaves only the new and challenging problems for you to solve. The average incident is more complex, but your reward is that the frequency of incidents goes way down.

I’d take that trade any day.

Finding the Similarities

I mentioned back in my War Games series that practice is vital when resolving incidents. If you don’t get hit with the easy incidents, each incident is brand new. That’s moderately terrifying. How can we fix a brand new incident every time?

If you think of every incident as new, we’ll have a tough time solving incidents in a quick, efficient, and (most importantly) reliable manner. We need to find the similarities between incidents and develop a process for resolving incidents that works every time. That way, we can be confident our known process will help us resolve a new, unknown issue.

Below is a process I developed that has worked exceptionally well for me over the past five years.

The Five Stages

Here are the five stages every team should go through during an incident:

  • System Impact - Identify impacted subsystems or services, how customers are experiencing these symptoms, and determine if a status page is required

  • Mitigation - Mitigate the issue temporarily to buy the team time, minimize customer impact, or reduce data corruption

  • Root Cause - Find and resolve the root cause

  • Restore - Fix corrupted data, notify customers of dropped data or missed notifications, scale down service clusters, etc.

  • Prevention - Prevent the issue from happening again by creating (and prioritizing) better instrumentation, alerting, bug fixes, automation, etc

Easy, right? Yes, on paper, but it can be challenging to identify and respond to these stages in practice. Let’s dive into each one.

System Impact

Identifying the affected parts of the system during an incident seems obvious, but you’d be surprised how quickly operators skip this step over. Taking a moment (could be 30 seconds, could be 10 minutes) to think about the customer impact of the incident is extremely important. You wouldn’t restart the DB if a customer couldn’t even tell you were in an incident, right? Without taking a moment to evaluate the system and think about customer-impacting symptoms, it’s entirely possible to take drastic action when it’s not yet necessary.

One of the most critical points I teach during on-call training is that the risk associated with an operations action needs to match or be below the customer-facing impact of the incident. If customers have slightly delayed notifications or the UI is slow, we should only be taking low-risk action. However, if the system is fully down then, we may need to take more drastic action.

Identifying system impact is not a one-time thing. During an incident, the team should constantly be (aka every 15 minutes or less) reevaluating the system impact so they can adequately evaluate risk. Additionally, by understanding the customer impact of the incident, the team should know whether a status page is needed or not.

Mitigation

Mitigation is the most underused tool in an operator’s toolbox. Why mitigate when we can just fix the issue? The answer is timing.

Mitigation buys the team time. By scaling out services, temporarily increasing VM sizes, turning off cleanup processes, throttling incoming requests, rolling back code, or any number of other mitigation strategies, operators can temporarily stabilize (or reduce the issue impact on) the system. Taking the time pressure off the incident through temporary mitigation allows the team to:

  • Bring additional subject matter experts online

  • Further investigate the issue using low-risk operations

  • Test potential fixes in lower environments

All of the items above are risk reduction tactics. During an incident, the last thing you want to do is accidentally make it worse through a knee-jerk, high-risk action.

Buy yourself time to try lower-risk, lower-impact actions instead of trying to be a hero.

If you don’t think there are any mitigation strategies available for your platform, then now is the best time to build some in.

Root Cause

Now comes the actual fixing. Hopefully, you have a status page open and have mitigated the issue to buy yourself a bit of time. Get the right experts on the line, find the right metrics, test out fixes in lower environments, and debug the issue.

The only thing I’ll add to this traditional debugging step is to take note of the things you’re missing during the incident. Complain as an operator to your developer self. Are metrics missing? Logs don’t make sense or are empty? Are no mitigation strategies available? Is different code running in prod than a lower environment? Make notes of these complaints; we’ll collect those notes during the Prevention phase. There is no better time to understand the operator’s pain than when you’re in an incident.

Restore

Congratulations, you’ve successfully solved the incident. It’s effortless to pat yourself on the back now that it’s over and try to forget it ever happened, but we’re not yet done. If you’ve played any swinging sport, you’ll know how important it is to follow through. We’ve still got work to do.

Depending on the type of incident, you may have a lot or no restoration to do. The restore phase is where we get the system back to a stable, uncorrupted state. Scale back down clusters and instances, clean up database corruption, unthrottle requests, and notify customers of any lasting impact of the incident. You may not be able to restore everything without going back into an incident, so some restoration might need to wait until after the Prevention phase. Do as much as you can as quickly as possible.

Finally, write a post-mortem. First, write an internal post-mortem, then create a summary to publish as a public post-mortem on your status page. Writing a post-mortem is a crucial part of the incident restoration. It forces the team to truly understand what happened. You need this understanding to prevent the same incident from happening again effectively.

Prevention

Remember all the complaints we made to ourselves during root cause analysis? This stage is where we make tickets for all of them and actually prioritize them. These are bugs. Prioritize them accordingly. Not only do we need to fix the true root cause of the issue if it still exists (you should know from writing the post-mortem), but we need to make the system more operable in the future.

Getting hit by an incident happens to everyone. Getting hit by the same incident a second time is less excusable because it’s preventable. Spend the time improving your system now instead of wasting time (and causing customer frustration) dealing with the same incident a second time.

Summary

Putting these five stages into practice is difficult. The lines between them are thin, and often you need to jump back a step or two throughout an incident. The key, as always, is practice. Understanding what’s expected in each stage and communicating with your team about which stage you are in will help everyone solve the correct problems at the right time. It’s so easy to jump straight to root cause analysis. It’s far better to take 2 hours to solve an incident after mitigating it within the first 5 minutes than 1 hour in a full outage.

War Games are the best way I’ve found to get the team speaking the same language during incidents. Please reach out if you want help running your own!

If you’re interested in training on handling SaaS production incidents, send me a note at brian@connsulting.io or schedule a time to chat at https://calendly.com/connsulting.


See this form in the original post

Related Content

See this gallery in the original post