Alerting That Doesn't Burn Out Your Team

Alerting That Doesn't Burn Out Your Team
Series: The Modern SDLC · Post 14 of 17 ← Post 13: Observability · Post 15: Incident Management →
Being on-call at a company with bad alerting is one of the more demoralising experiences in engineering. Your phone wakes you at 2am. You investigate. Nothing is wrong. You go back to sleep. It wakes you again at 3am. Different alert, same outcome. By the time a real problem fires at 5am, you've already been awake twice for no reason, your judgment is impaired, and you've been trained by experience to treat every alert as probably noise.
This is alert fatigue, and it's a safety issue as much as a quality of life issue. Teams with alert fatigue don't just have unhappy engineers — they have slower incident response because the signal has been buried under noise. The alert that matters gets the same treatment as the forty that didn't.
The goal of a good alerting system is the opposite of what most teams build. Not more alerts. Not comprehensive coverage. Not a page for every anomaly. Fewer alerts. Higher signal. Pages only when users are actually affected. Everything else in a Slack channel or a ticket, not a pager.
The one thing to remember
Alert on symptoms, not causes. Users don't care that CPU is at 90%. They care that the checkout is slow. Alert on what users experience, investigate causes using the observability tools from Post 13.
Symptoms vs causes: the reframe that changes everything
Most teams build cause-based alerts first because they're intuitive. CPU high, disk full, pod restart detected, memory above 80% — these feel like the right things to watch because they represent internal resource states that can lead to problems.
The trouble is that internal resource states and user experience have an inconsistent relationship. CPU at 90% often means nothing to users. Memory at 75% might be entirely normal. A pod restarting might be a routine liveness probe cycle. These alerts fire constantly, mean little individually, and train engineers to ignore them.
Symptom-based alerts are grounded in user experience. Error rate above 1% for five minutes means users are failing. p95 latency above 500ms means users are waiting. SLO burn rate will exhaust the error budget in two hours means reliability is degrading faster than acceptable. Checkout success rate below 99% means revenue is impacting right now.
The practical rule: if an alert fires and everything is working fine for users, it's a cause-based alert dressed as a symptom. Move it to warning — a Slack channel notification, not a page. Cause-based metrics are invaluable diagnostic tools during an incident, but they belong on dashboards, not pagers.
The test for any alert you're considering: if this fires at 2am and the engineer investigates and finds nothing wrong for users, what was the point? If the answer is "not much," the alert isn't ready to page. It might be useful as a dashboard indicator or a ticket trigger, but not as a pager alert.
The three requirements every alert must satisfy
Before adding any alert to the paging rotation, it should satisfy all three of these. If it fails any one, it isn't ready.
Actionable. There is a specific, documented response. "Check the runbook" is a response, but only if the runbook exists, is current, and actually tells the engineer what to do. An alert that fires and leaves the engineer staring at dashboards with no guidance isn't actionable — it's a notification that something might be wrong.
Urgent. It genuinely cannot wait until the morning. If a reasonable person would say "we can deal with this at 9am," it shouldn't be a page. It should be a ticket created automatically, or a Slack notification, or an email. The threshold for waking someone up should be high.
Accurate. It fires when something is wrong and reliably doesn't fire when everything is fine. An alert that fires three times a week without a real problem is an alert that will be ignored by the fourth week. Alert accuracy is not a nice-to-have — it's the property that determines whether the alert system is trusted.
The practical implication of these three requirements is that most alerts most teams have don't satisfy all three. Running an audit against this criteria — for every alert: is it actionable, is it urgent, is it accurate? — typically results in removing or demoting the majority of existing alerts. That's the right outcome.
SLO burn rate alerting: the highest signal-to-noise ratio available
SLO burn rate alerting is the most advanced and most reliable alerting strategy available, and it flows directly from the SLOs defined in Post 13.
Instead of alerting on raw thresholds ("error rate above 1%"), you alert on how fast you're consuming your error budget. A burn rate of 1× means you'll exactly exhaust the budget at the end of the SLO window — sustainable but not good. A burn rate of 14.4× means you'll exhaust the monthly budget in 2 hours — page someone now.
The key insight: the burn rate number makes urgency explicit and comparable across services. You don't need to calibrate different thresholds for different services. The same burn rate thresholds — fast burn page, slow burn ticket — apply consistently, and the error budget automatically calibrates them to each service's SLO.
Multi-window burn rate alerts are the practical implementation that reduces false positives. Use two time windows simultaneously: a short window (5 minutes or 1 hour) catches fast-burning incidents, a long window (6 hours or 3 days) catches slow degradations that a single window misses. Alert only when both windows show elevated burn rate. A traffic spike that raises the error rate for two minutes fires the short window but not the long one — no page. A sustained 2% error rate fires both windows over time — page.
The Google SRE book's recommended thresholds for a 99.9% SLO:
Page immediately: burn rate above 14.4× on both a 1-hour and 5-minute window
Create a ticket: burn rate above 6× on both a 6-hour and 30-minute window
No alert: everything below those thresholds is within budget
These numbers look arbitrary until you work through the math. A 14.4× burn rate on a 30-day budget means the budget exhausts in 50 hours — an alert that fires when you have two days of error budget left, giving you time to fix the problem before users start experiencing significant reliability degradation.
Alert anatomy: what every notification must contain
When an alert fires at 2am, the engineer who receives it is tired, potentially not deeply familiar with the affected system, and needs to make a decision quickly. The notification should give them everything they need to do that without hunting for context.
Every alert notification should contain:
What is broken — service name and the symptom. "Payment service — error rate elevated" not "Alert fired."
How bad it is — current value versus threshold. "Error rate 4.2% (threshold: 1%)" not just "threshold exceeded."
For how long — duration of the condition. "For 8 minutes" tells you whether this is a brief spike or a sustained problem.
Link to the relevant dashboard — one click to the monitoring view scoped to this service and this time window. Not the root of Grafana. The specific dashboard.
Link to the runbook — one click to the documented response procedure for this specific alert. The runbook should already exist before the alert is in rotation.
Teams that implement this anatomy consistently report significantly faster time-to-acknowledge and time-to-diagnose. The information isn't available elsewhere — it's in the notification. The engineer doesn't start the incident by hunting for context; they start it by reading the alert.
Runbooks: what good looks like at 3am
A runbook is the document that tells an on-call engineer exactly what to do when a specific alert fires. The standard for a good runbook is this: give it to a competent engineer who has never seen the affected system before and see if they can follow it to resolution without asking anyone for help. Every place they get stuck is a gap.
The minimum contents of every runbook:
What the alert means in plain language
Immediate triage steps — is this actually affecting users? What is the scope of impact?
Diagnostic commands or queries to run, copy-pasteable
The most likely causes, ranked by probability from most to least common
Remediation steps for each likely cause
How to verify the issue is resolved
Escalation path if the standard steps don't resolve it
Rollback procedure if a recent deployment caused it
Link to the relevant dashboard and service documentation
Runbooks live in the repository. Not in a separate wiki that nobody updates, not in someone's personal notes, not in a Google Doc that's three years old. /docs/runbooks/ in the service repository, versioned alongside the code, linked directly from the alert definition. When code changes in a way that affects the operational procedure, the runbook PR and the code PR are in the same review.
The 3am test is the practical quality check. After writing a runbook, walk through it yourself at normal mental capacity. Then give it to the most junior engineer on the team and ask them to follow it. Every point of confusion is a runbook deficiency, not a deficiency in the engineer.
Runbooks go stale. Add a "last verified" date. Include a step in every post-mortem to review whether the runbook was helpful and update it based on what was learned. A runbook that accurately described the system two years ago may be dangerously wrong today.
Alert grouping and deduplication
A cascading failure can fire fifty alerts simultaneously — one per affected pod, one per affected endpoint, one per affected region. An engineer who receives fifty pages in thirty seconds while trying to respond to an incident has a coordination problem on top of a technical problem.
Alert grouping and deduplication is the practice of collapsing related alerts into a single incident notification. "Payment cluster degraded — 12 related alerts" is one page. "payment-pod-1 down, payment-pod-2 down, payment-pod-3 down..." is twelve pages for what is, from the engineer's perspective, one problem.
Alertmanager (the alert routing layer for Prometheus-based stacks) handles grouping natively. Group alerts by service and alert name. Set a group wait of 30 seconds — collect related alerts that fire together before sending a notification. Set a group interval of 5 minutes — collect additional alerts that fire for the same group during an active incident and send a summary rather than individual notifications.
PagerDuty and OpsGenie both support alert grouping rules at the incident level. Alerts that share a service, a region, or an alert type can be routed into a single incident rather than creating one incident per alert.
The configuration investment pays back on the first major incident. When everything is going wrong at once, the last thing the incident commander needs is fifty separate pages to acknowledge.
On-call structure: rotation design that doesn't destroy people
On-call is a cost that falls on individuals. Managed poorly, it causes burnout, attrition, and a culture of dread around production. Managed well, it's a reasonable, compensated part of owning a system — with clear boundaries, fair rotation, and support structures.
Rotation design. Primary on-call plus a secondary. The primary responds first; the secondary is the escalation path and backup if the primary is unreachable. Rotate weekly — not monthly (too long, responders carry the burden for extended periods) or daily (too disruptive, constant context switching). The minimum team size for a sustainable rotation is four to five engineers. With fewer, someone is on-call more than once every three to four weeks, which means they're effectively never fully off.
Response time expectations. Defined and documented, not assumed. SEV1: acknowledge within five minutes, first update within fifteen. SEV2: acknowledge within fifteen minutes. SEV3: next business day. These are commitments the team makes collectively, not suggestions. If response times aren't being met, investigate whether the rotation is understaffed or the alert volume is too high — not whether engineers need to be more responsive.
Compensation. Being on-call has a cost even when nothing fires — the cognitive burden of being interruptible affects sleep quality, weekend plans, and focus. This should be acknowledged and compensated: explicit on-call pay, time off in lieu for nighttime incidents, or both. Teams that treat on-call as "part of the job" without compensation find it difficult to maintain rotation participation and suffer the downstream effects on retention.
Handoff procedures. At rotation transition, document open incidents, known fragile systems, recent deployments that need watching, and anything unusual that the incoming engineer needs to know. A fifteen-minute sync between outgoing and incoming on-call at handoff prevents the incoming engineer from being blindsided in their first hour. Write it down — don't rely on a verbal briefing that happens once and is immediately forgotten.
Toil measurement. Track on-call load: pages per week, time spent per incident, percentage of incidents that required manual intervention versus resolved automatically. If a team is spending more than 20–25% of their time on operational work — toil — that's a systemic problem. The answer is automation, better alerting, or architectural fixes. Not "work harder."
The quarterly alert audit
Alert systems accumulate. Every incident generates a suggestion to add an alert. Every monitoring review surfaces more things to watch. Left unmanaged, the alert system grows continuously and the signal-to-noise ratio degrades continuously.
A quarterly audit — taking two hours, reviewing every alert against a consistent set of criteria — prevents this accumulation from compounding.
For every alert in the system, answer:
Does this alert have a runbook? If not, either write the runbook before the next quarter or remove the alert from the paging rotation.
Did this alert fire in the last 90 days? If not, it's either never-firing (no signal value) or it was fixed and nobody cleaned it up. Disable it and archive it. If it was important, the next incident will surface it again.
How many times did it fire and what percentage required action? An alert that fires eight times a week and requires action twice is 75% noise. Raise the threshold, add a longer evaluation window, or demote it to a ticket.
Does it fire outside business hours? If it pages at 2am and is not SEV1 or SEV2, route it to a ticket instead. The severity classification should determine the routing, not the alert name.
Has the runbook been updated in the last six months? If not, schedule a runbook review.
Can a new team member follow the runbook without help? Test this. The answer is usually no until it's been tested.
The goal of this audit: by the end, every alert that pages someone does so for a real reason, with a documented response procedure, that has been validated to actually indicate a user-facing problem. An alert system that passes this audit is one the team can trust — and a team that trusts its alerts responds to them.
What goes wrong when alerting is broken
Alert fatigue compounds. Once engineers start ignoring alerts because most are noise, they start ignoring real ones too. The trained response to a page becomes "probably nothing" rather than "something is wrong." The damage accumulates faster than it's created — each false positive makes the next real alert slightly less likely to get the attention it deserves.
The cause-based page. CPU is high. Someone gets paged. They investigate. Nothing is wrong for users. They go back to sleep. This repeats until the engineer starts adding --no-pager to every response. The fundamental issue — alerting on causes rather than symptoms — never gets fixed because nobody has time to fix alerting when they're busy being paged.
The missing runbook. An alert fires for a service that three people on the team deeply understand and everyone else has never touched. The on-call engineer that month is not one of the three. They receive a page, open the alert, find no runbook, and spend forty minutes in a Slack channel trying to find someone who knows the system. The runbook gap directly extended the incident.
The solo rotation. A team of two splitting on-call means one person is always on-call — there's no "off." This is not sustainable over months. Engineers in this situation gradually develop a persistent anxiety about production that affects everything else they do. If the team is too small for a healthy rotation, that's a hiring conversation.
If you do one thing from this post
Run the alert audit on one service. Pick the service that generates the most pages and go through every alert it has. For each one, ask: is it actionable, is it urgent, is it accurate? Remove the ones that fail. Add runbooks to the ones that don't have them. Raise the thresholds on the ones that fire too often.
Don't do this alone — do it with the team, in a meeting, with the on-call rotation present. The people who get paged at 2am know which alerts are noise and which ones matter. That knowledge is worth surfacing and acting on.
Next up: Post 15 — Blameless Post-Mortems: How to Turn Outages Into the Best Learning Your Team Gets
← Post 13: The Three Pillars of Observability: And Why You Need All Three



