SaaS War Games - Part 3: Running a War Game
Over the last two articles (SaaS War Games - Part 1 and SaaS War Games - Part 2), we dove into the value of War Games. In this article, the rubber meets the road on how to run one.
Planning is Key
Planning is critical to running a successful War Game. One of the core goals is for the incident to feel real, so expect to spend 3-4x the amount of time planning the War Game as you spend running it.
Prepare the Team
First, you need to prepare the team for the event. You should:
Block out time on the team’s calendar as far in advance as possible (90 minutes is the absolute minimum for a War Game, 2 hours is better)
Schedule the War Game at a good time for everyone (time zones are always fun to manage)
Not too early, not too late (unless you’re building a more advanced game where you want people not fully awake)
Not too close to key deliverables or the end of a sprint
Not on meeting-heavy days
Advise people to eat a meal before the game or bring a snack and drink (you don’t need even more stress during the game)
Prepare gamers that they need to be 100% focused during the game
Turn off email and mute Slack
Request volunteers for roles (and push folks towards new roles if they always pick the same ones) in a public forum like Slack well in advance of the game
For a game to be effective, the team has to be fully present. Make sure you communicate this to engineers and their managers. War Games are mentally draining. Expect this to be the event for the day.
Prepare the Incident
Next, you need to construct the incident. This is more of an art than a science and requires you to understand your system deeply. Below are a few things to consider when constructing an incident:
Great incidents have one root cause yielding multiple symptoms, ideally in very different parts of the system
E.g. introduce network latency between a downstream app and a database causing upstream services to slow down
Partial failures usually make for more exciting games than total failures
Partial failures really test the team’s knowledge of how distributed systems fail and are generally harder to pinpoint
Also, partial failures are usually mitigatable, which encourages mitigation before fixing the root cause
Good incidents are hard to figure out but obvious afterward
Simplicity is key
You may want to tie the incident to a new feature or recent production issue
Use the War Game to highlight new potential failure modes with the system so the team can patch them up immediately
Work backward when creating an incident. Creating a good incident is like writing a good mystery novel: start with the twist, then work backward to make sure the right clues show up.
Once you create your incident: test it. Make sure the symptoms show up and in a reasonable amount of time. Make sure the alerts fire, the message queues back up, or the UI is slow.
Running a War Game
Just Shut Up
I’ll admit, to run a War Game you have to be a bit sadistic. Just know it’s for a good cause. The most important thing for you to do as a game runner is to shut up. Let the team struggle. Let the uncomfortable silences linger. After your introduction to the War Game, just don’t talk for 30 minutes. Those 30 minutes are unbelievably crucial because the team will choose to either sink or swim. This is the number one lesson of the War Game: this is your system to fix; you need to find a path forward.
It’s so easy to jump in and give a hint. Or say the team is going in the wrong direction. Or just talk to fill the silence. Don’t. If your first War Game is 90 uncomfortable minutes of silence, then you should consider it a success. The first step to getting better is realizing you have a problem. It looks like the team needs a lot more War Games!
Take note of who steps up during War Games. Ideally, it’s the Incident Commander, but take note of who chooses to swim. These are vital resources for you to develop. Those who sink may need a confidence boost of more training. Both are OK; they just require different types of training.
Providing Help
One of my favorite questions to ask during a War Game is “Should we open a status page?” especially when there isn’t a clear answer. This forces gamers to really think about how a user would experience the current incident. The challenge is that by the time you’re sure you need a status page, it’s probably too late. Do we need a status page if messages are delayed by 10 seconds? A minute? 10 minutes? It depends on the message. War Games are a way for the team to explore these difficult decisions.
After the 30 minute mark, you can give a hint here and there, but only answer questions with other questions. Additionally, the team may not contain SMEs from all disciplines, so you can step in and answer questions as if you were a member of that other team (this emulates escalation to another team). Don’t be omniscient, though. Only answer the question as someone from that team would.
Retrospecting
Reserve the last 30 minutes (this is why a two-hour block is best) to retrospect. First, give the team 5 minutes to cool off after the incident and walk around. War Games can be intense, so give people a minute to reset.
Go around the group and ask each individual to reflect. How do they think they and the team did? Sometimes people will be overly critical of themselves, so make sure to encourage them and remind the team that there was no expectation of solving the issue entirely. Ask them what they thought of their role and how others could help them.
Finally, provide constructive feedback to the team. War Games are learning experiences. Did the team jump right to root cause? Were no actions written down? Did a status page go up at the wrong time or not at all? Gear this feedback towards the experience level of the team and raise the bar every War Game. Provide the right balance of encouragement and critical feedback. Even with critical feedback, all teams I’ve worked with have been excited to set up the next War Game and do even better during it.
Conclusion
Hopefully, now you’re excited and prepared to run your first War Game! They are a fun way to prepare your team for SaaS production operations. Still, they’re also an extremely valuable tool for your organization to de-risk operations and identify leaders within your organization. Please reach out to me at brian@connsulting.io to let me know how your first War Game goes!
If you’re interested in learning more about War Games or would like help running your first one, please send me a note at brian@connsulting.io or schedule a time to chat at https://calendly.com/connsulting.