Connsulting

View Original

System Impact and Mitigation

Summary

The goal of incident response is to minimize the total impact on customers over time through mitigation and root cause resolution. For example, a high-impact, short-duration incident (five-minute total outage) can be as impactful to a customer as a low-impact, long-duration incident (slowness for a full day).

A key component of SaaS incident response is to mitigate the incident, if possible, to lessen the immediate impact on the customer and buy the team time to resolve the issue permanently.

Description

SaaS incident response can be broken into five stages (see The 5 Stages of a Production Incident for more details):

  • Identifying the System Impact

  • Mitigating the issue

  • Resolving the Root Cause

  • Restoring the system to a healthy state

  • Preventing the issue from occurring again

Reducing the System Impact

The System Impact of the incident is broken into two components:

  • The severity of the incident (Inconvenience, Degradation, Partial Outage, and Major Outage)

  • The number of users impacted (single user, multiple users within a tenant, multiple tenants)

Both metrics measure the instantaneous impact on users (the pain they experience at a given moment). This impact, summed over time, is the total incident impact on users. By using this measure, reducing the instantaneous system impact to a lower value is almost as valuable as resolving the incident entirely. However, jumping straight to root cause analysis will result in a higher total incident impact even if the overall incident is resolved faster (in some cases).

See the effects of mitigation on the total incident impact on users in the graphs below.

Instantaneous incident impact on users with and without mitigation

Total incident impact on users with and without mitigation

Mitigation Tactics

Examples of mitigation tactics are:

  • Scaling out services horizontally (useful for large message bus backlogs)

  • Scaling up services vertically (increasing the size of instances if the application can’t scale well horizontally)

  • Throttling particular tenants or operations (should be used only in coordination with the support team)

  • Increasing DB or disk IO (to support more concurrent operations within platform bottlenecks)

  • Temporarily disabling features using feature toggles

If your application doesn’t support these operations, consider investing in them. Providing operators with tools to mitigate incidents will greatly improve your teams' response time and your platform's resilience.

Summary

Always attempt to mitigate unknown issues before jumping to root cause analysis during an incident. You can perform these steps in parallel if multiple engineers respond to the incident. Discuss the correct amount of time to try mitigating with your team before moving on to root cause analysis. This will change based on the incident severity.

If you and your team are unsure, try running a War Game to practice.


Related Content

See this gallery in the original post