In high-performance computing, we are trained to hunt for bottlenecks in our code, our algorithms, or our infrastructure. But my favorite bug was not in any of those. It was an invisible interaction between the JVM’s garbage collector and the server’s disk, resulting in 15+ second, stop-the-world (STW) pauses on a service handling millions of requests per second.
The Mystery: The 503 Spikes
I was working on a large-scale Java service handling millions of user requests per second. The system was designed for extreme throughput, but we were plagued by intermittent spikes in load balancer timeouts, causing 503 responses to be returned to the users.