The Contradiction
At 3:47 AM, your monitoring dashboard shows a healthy Kubernetes cluster — 99.97% availability. Your customers report a complete outage.
Ninety seconds later, the pod has self-healed. Metrics look normal. The restart counter reads “1.” But why it restarted — what actually happened — has vanished. This isn’t a tooling failure. The system simply recovered faster than a human could observe.