SaaS War Games - Part 2: War Game Basics

This article builds on SaaS War Games - Part 1, so I recommend reading that article before diving into this.

In the first article of this series, we identified a few challenges of production incidents. They’re fast, filled with pressure, and are (hopefully) brand new failures. If the best-case scenario is a new failure (remember: repeated failures mean we never solved it the first time), how can we practice? Won’t we be practicing for a scenario that we’ll never see again? The key is to practice the process and not the specific failure.

Practicing the Process

War Games are real incidents triggered by a game runner in a test environment. A team of War Gamers (between one and five responders) works to identify, mitigate, fix the root cause, cleanup, and mark follow-up items for prevention (these five stages are a post of their own), all while communicating the issue internally and through a status page.

All this in two hours may sound stressful...and that’s the point. Production incidents are stressful. War Games expose the team to that stress and force them to work through it. I’ve run many War Games and no team has made it through all five stages in the time allocated. That’s OK. Many teams don’t even make it through mitigation. The goal is to practice, practice, practice. It usually takes 5-6 War Games before responders get the basics down. I much rather that be 5-6 mock incidents than real ones.

War Games are stressful

…and that’s the point

Roles

A lot of work is going on at the same time during a War Game. To stay organized, I teach a process based on the PagerDuty Incident Response Documentation. This documentation describes a variety of roles that are critical to managing effective incident response. Below are a few key roles:

  • Incident Commander: Coordinator of the incident, responsible for making decisions and moving the incident forward

  • Subject Matter Expert (SME): Expert on a specific part of the system, deep-diving into logs, metrics, and code at the request of the Incident Commander

  • Liaison: Manages high-level updates to internal stakeholders (executives) and external users through a status page (Combined from the Internal and External Liaison roles in the PagerDuty documentation)

  • Scribe: Responsible for making sure all actions and follow-up items are documented (not responsible for taking all notes, but responsible for filling in gaps)

The PagerDuty documentation is a fabulous resource. Above I have highlighted the roles I find most critical. Often there will not be enough people to fill all the roles, so some responders need to wear multiple hats. That is perfectly fine as long as all roles are filled.

Keeping Focused

One of the most significant challenges responders face is a bit of a counterintuitive one: they want to help too much. This paradox sounds odd at first. How could helping be an issue? The subtlety is helping with what.

Read the documentation for each of the roles described above. Each one has an essential job to do. What happens when all the responders crash on “fixing the issue” and jump on a call? That really means that:

  • There is no communication internally or on a status page (the Liaison’s job)

  • No actions (rolling back to a previous code version, scaling out a service) are documented, which could lead to those accidentally being undone at a later time (the Scribe’s job)

  • No follow up items (such as poor logging or missing credentials) are tracked to be fixed later (also the Scribe’s job)

  • No one is present to focus on mitigating the issue quickly before spending multiple hours trying to fix the root cause (the Incident Commander’s job)

Each role is critical to making the team function. It’s excellent when responders jump in to help, but every responder needs to understand their primary role. You don’t get bonus points for performing a secondary job if you’re not nailing your primary job. This is the most challenging concept to train responders on and is well worth multiple War Games to practice.

Successfully resolving a production incident requires the team to work as a team, not a group of individuals trying to resolve the incident by themselves. Relying on heroes to single-handedly solve the issue with no communication other than “found it” is hugely detrimental to your organization.

Rotating Roles

Lastly, these roles should not be static. Rotate responders through different roles. War Games are supposed to be uncomfortable, and sometimes you need to strongly encourage someone to try out a role. This is practice. We expect to make mistakes.

Even if a responder never plays that role during an actual incident, they’ll have a better appreciation for how that role operates. It only takes one rotation as a Scribe to understand just how important it is for others to document what they’re doing. One rotation as a Liaison makes you appreciate how vital regular summarization of the incident status from a SME is. Responders will be better teammates to each other once they see how difficult other roles are.

Conclusion

We’ve now got the War Game raw materials. We know the roles, we know the goals, and we’re ready to start our first War Game! In Part 3, we’ll work through planning, running, and retrospecting on a War Game.

 

If you’re interested in learning more about War Games or would like help running your first one, please send me a note at brian@connsulting.io or schedule a time to chat at https://calendly.com/connsulting.


Related Content

Previous
Previous

SaaS War Games - Part 3: Running a War Game

Next
Next

SaaS War Games - Part 1: Getting Comfortable with Being Uncomfortable