ops, dev Brian Conn ops, dev Brian Conn

Key Alerting Metrics

Good alerting is critical to operating a SaaS (or any other software) platform. Good alerts are timely, actionable, understandable, and correct. In this context, correct means minimizing false positives (alerting when there is not an issue) and false negatives (not firing when there is an issue).

Key Alerting Metrics are four metrics to monitor a whole system, subsystem, or microservice based on customer pain which all production engineers can understand.

Read More
ops Brian Conn ops Brian Conn

System Impact and Mitigation

The goal of incident response is to minimize the total impact on customers over time through mitigation and root cause resolution. For example, a high-impact, short-duration incident (five-minute total outage) can be as impactful to a customer as a low-impact, long-duration incident (slowness for a full day).

A key component of SaaS incident response is to mitigate the incident, if possible, to lessen the immediate impact on the customer and buy the team time to resolve the issue permanently.

Read More