The incident had been running for forty-seven minutes when I watched the on-call engineer open his sixth browser tab. Grafana for the infrastructure metrics. Splunk for the application logs. A separate Jaeger instance — legacy, running on a server that was itself poorly monitored — for traces from the API layer. A custom dashboard someone had built in Kibana eighteen months earlier for the payment service, which used a different logging format than everything else. And a Datadog trial that a team had spun up six weeks prior for a new microservice, not yet integrated with anything.
He wasn’t incompetent. He was experienced, methodical, and clearly doing his best under pressure. The problem was that the answer — a cascade that had started when a downstream dependency began timing out under load, causing queue depth to grow on a service that nobody had instrumented with queue metrics — was distributed across four systems that had no awareness of each other. He had to hold the context in his head. Manually. While an incident was live.