The instinct is hardwired into every engineer who’s shipped production code: if the call fails, try again. It feels responsible — a small buffer against network chaos and flaky backends. But that instinct, unchecked, is how you turn a recoverable hiccup into a four-hour outage that gets the CTO on Slack asking what the hell happened.
I’ve been in the war room when it happens. A service stumbles — maybe a deployment didn’t fully bake, maybe the database hit some lock contention — and suddenly every client in the datacenter decides now is the time to demonstrate grit. What was a localized wobble becomes a stampede. The service that was successfully handling 80% of requests gets buried under 300% of normal traffic, nearly all of it retries. Recovery becomes impossible. The system just thrashes, burning CPU to accomplish nothing.