10 Hard-Won Lessons from 5 Years On-Call

Being on-call teaches you things no classroom or certification ever could. After five years of being paged at 3 AM, handling multi-region outages, and writing more post-mortems than I can count, certain patterns keep surfacing. These aren't abstract principles — they're lessons I re-learn the hard way every time I ignore them.

1. Your runbooks are lying to you

The documentation you wrote six months ago is already out of date. Services evolve, owners change, and nobody updates the runbook. The moment you're fumbling through a broken runbook during a P1 is the wrong time to discover this.

Fix: Build runbook verification into your change process. Every deploy that alters behavior should touch the runbook. Better yet, make runbooks executable — a script that runs the diagnostic steps is harder to leave stale than a paragraph of prose.

2. The alert that fires most is the one you trust least

Alert fatigue is real and it's insidious. When your PagerDuty fires 40 times on a Tuesday for a "low disk" alert that has never once led to a real incident, you stop treating it seriously — and then one day it's a real incident and you sleep through it.

Audit your alerts quarterly. If an alert hasn't led to meaningful action in 90 days, silence it or fix the underlying issue. An alert with no action is just noise with a SLA attached.

3. Correlation is not causation, but it's a damn good starting point

You deployed at 14:00 and errors started at 14:04. That's suspicious but not a verdict. I've wasted hours chasing a deploy that turned out to be coincidental while the actual cause was a third-party API silently rate-limiting us.

Correlate fast, but stay open. Build your incident investigation as a tree, not a tunnel.

4. Communicate before you understand

In the first 5 minutes of an incident, you don't know what's happening. That's fine — nobody expects you to. What they do expect is acknowledgment. A quick "I'm investigating, will update in 10 min" to the status page and your incident channel buys you time and stops executives from pinging you directly.

The worst incident communication is silence.

5. The blast radius is always bigger than you think

"This only affects users in us-east-1 with 5-year-old accounts." Famous last words. Dependencies are a web, not a line. Before you declare scope, map your blast radius on a whiteboard. You'll almost always find an edge case you missed.

6. Rollback is almost always faster than fix-forward

I know you want to fix it. I know the bug is obvious. Do it anyway: rollback first, fix second. A rollback takes 2 minutes; even a "simple" fix takes 20 because you have to write it, review it, test it, deploy it, and verify it — under stress, with people watching.

The only exception is when rollback itself poses risk (irreversible data migrations, etc.). And if that's the case, you should have planned for it.

7. Your monitoring doesn't know what it doesn't know

"The metrics look fine" is not the same as "the system is healthy." Black-box monitoring tells you that requests are completing — it doesn't tell you they're completing with corrupted data, or that they're slow for 10% of users in a specific region, or that the cache is serving stale content from three hours ago.

Invest in synthetic monitoring and business-level metrics. Measure what users actually experience, not just what's easy to instrument.

8. Post-mortems without action items are just venting

I've been in post-mortems that produced beautiful writing, thorough timelines, and zero follow-through. The meeting felt productive. Nothing changed. Six months later, the same failure mode bit us again.

Make action items the most important artifact of your post-mortem. Assign owners, set due dates, track them in your issue tracker. The writing is for learning; the action items are for improvement.

9. Incident command is a skill, not a role

When everything is on fire, someone needs to coordinate — not diagnose, not fix, but coordinate. Keep people from stepping on each other. Maintain a coherent picture of what's known and unknown. Communicate outward. This is the incident commander role, and it is completely different from the engineering work happening in parallel.

The best engineers I know actively resist the urge to jump into fixing mode when they should be commanding. It takes discipline.

10. Your future self will be paged on this code

I write better runbooks, better alerts, and better logs when I remember that the next person on-call for this service might be me at 3 AM in six months. This reframing is the cheapest reliability improvement available.

Build the observability you'd want to have during an incident. Write the documentation you'd want to find. Leave breadcrumbs.

On-call is hard. But the teams that get good at it don't just survive incidents — they build systems that have fewer of them. That's the real goal.