A 60-hour training job had become the new normal. GPUs were saturated, data pipelines looked healthy, and infra monitoring didn’t flag any issues. But something was off. The model wasn’t large, nor was the data complex enough to justify that duration. What we eventually discovered wasn’t in the Python code or the model definition. It was buried deep in the compiler stack.
Identifying the Invisible Bottleneck
