Key Alerting Metrics
Summary
Good alerting is critical to operating a SaaS (or any other software) platform. Good alerts are timely, actionable, understandable, and correct. In this context, correct means minimizing false positives (alerting when there is not an issue) and false negatives (not firing when there is an issue).
Key Alerting Metrics are four metrics to monitor a whole system, subsystem, or microservice based on customer pain which all production engineers can understand.
Description
Key Alerting Metrics treat each system as a black box and proxy what a customer is feeling. If a customer can’t notice an issue in the system, it most likely shouldn’t trigger an alert.
These metrics are:
Synchronous delay (HTTP response time) - A proxy for how responsive the UI (and API for programmatic access) is for the user
Synchronous errors (HTTP error percentage) - Errors often bubble up to the UI, leading to an unpleasant user experience
Asynchronous delay (message bus lag or in queue time) - Message buses are used to queue work, but work should be processed quickly. Increased queue length or long delays in processing can lead to an unpleasant user experience (e.g. delayed notifications, long data processing times).
Asynchronous errors (message bus processing error percentage) - Errors in message processing can lead to dropped data, missing notifications, or incomplete results. Though these errors are not as visible as synchronous errors, they still negatively affect user experience.
Alerting vs Debugging Metrics
The metrics above minimize false positives and negatives (correct), are consistent across different parts of the platform (understandable), and reflect the current user experience of the product (timely). Still, these alone may not pinpoint production issues (actionable). Key Alerting Metrics track symptoms, not root causes. These must be paired with debugging metrics (highly specific service metrics often only understood by Subject Matter Experts) to understand the details of the incident. These metrics are often too specific to build into alerts but are useful during debugging.
By alerting on well understood, customer facing symptoms and debugging using detailed, service-specific metrics organizations can respond quickly and effectively to production incidents with minimal alert noise.