The dashboard turned red at weekday. Our order processing API latency jumped from fifty milliseconds to five seconds. Customer support tickets flooded in. Users reported timeouts during checkout. The infrastructure team scaled up the Kubernetes pods, but the issue persisted. CPU usage sat at 100 percent across all nodes. We were throwing hardware at a software problem. This approach failed miserably.
In this article, I will share how we diagnosed the bottleneck. I will explain the profiling tools we used. I will detail the code changes that restored performance. This is not a theoretical guide. It is a record of a real production incident and the steps we took to resolve it.