Summary

Good alerting is critical to operating a SaaS (or any other software) platform. Good alerts are timely, actionable, understandable, and correct. In this context, correct means minimizing false positives (alerting when there is not an issue) and false negatives (not firing when there is an issue).

Key Alerting Metrics are four metrics to monitor a whole system, subsystem, or microservice based on customer pain which all production engineers can understand.

Description

Key Alerting Metrics treat each system as a black box and proxy what a customer is feeling. If a customer can’t notice an issue in the system, it most likely shouldn’t trigger an alert.

These metrics are:

Synchronous delay (HTTP response time) - A proxy for how responsive the UI (and API for programmatic access) is for the user
Synchronous errors (HTTP error percentage) - Errors often bubble up to the UI, leading to an unpleasant user experience
Asynchronous delay (message bus lag or in queue time) - Message buses are used to queue work, but work should be processed quickly. Increased queue length or long delays in processing can lead to an unpleasant user experience (e.g. delayed notifications, long data processing times).
Asynchronous errors (message bus processing error percentage) - Errors in message processing can lead to dropped data, missing notifications, or incomplete results. Though these errors are not as visible as synchronous errors, they still negatively affect user experience.

Alerting vs Debugging Metrics

The metrics above minimize false positives and negatives (correct), are consistent across different parts of the platform (understandable), and reflect the current user experience of the product (timely). Still, these alone may not pinpoint production issues (actionable). Key Alerting Metrics track symptoms, not root causes. These must be paired with debugging metrics (highly specific service metrics often only understood by Subject Matter Experts) to understand the details of the incident. These metrics are often too specific to build into alerts but are useful during debugging.

By alerting on well understood, customer facing symptoms and debugging using detailed, service-specific metrics organizations can respond quickly and effectively to production incidents with minimal alert noise.

Related Content

Featured

Sep 28, 2022

Key Alerting Metrics

Sep 28, 2022

Good alerting is critical to operating a SaaS (or any other software) platform. Good alerts are timely, actionable, understandable, and correct. In this context, correct means minimizing false positives (alerting when there is not an issue) and false negatives (not firing when there is an issue).

Key Alerting Metrics are four metrics to monitor a whole system, subsystem, or microservice based on customer pain which all production engineers can understand.

Sep 28, 2022

Aug 29, 2022

System Impact and Mitigation

Aug 29, 2022

The goal of incident response is to minimize the total impact on customers over time through mitigation and root cause resolution. For example, a high-impact, short-duration incident (five-minute total outage) can be as impactful to a customer as a low-impact, long-duration incident (slowness for a full day).

A key component of SaaS incident response is to mitigate the incident, if possible, to lessen the immediate impact on the customer and buy the team time to resolve the issue permanently.

Aug 29, 2022

Aug 16, 2022

Product Delivery Team

Aug 16, 2022

For SaaS companies, the Product Delivery Team is all individuals involved in building and operating the product. All these sub-teams share a common goal: continuously deliver customer value.

Aug 16, 2022

Key Alerting Metrics

Summary

Description

Alerting vs Debugging Metrics

Related Content

Connsulting

About

Offerings