Thu. Mar 5th, 2026

Autoscaling Is Not Elasticity


The first autoscaling incident I handled personally left me staring at CloudWatch graphs at 3 AM, watching our infrastructure commit suicide by optimization. The Auto Scaling Group was doing exactly what we’d configured it to do — launching EC2 instances to meet target capacity — except AWS’s control plane was choking on an internal partition failure, and every instance we launched hung in pending state, timed out health checks, got terminated, triggered a replacement launch. Over and over. We’d built a perfect feedback loop of failure, and the irony was we’d followed the documentation to the letter.

The fix took four minutes once we understood what was happening: manually pin the ASG at current capacity, stop all scaling activity, let the control plane recover on its own schedule. But we didn’t have a procedure for that. No Terraform config sitting in version control with min_size = desired = max_size = 12. No runbook that said “when AWS is on fire, turn off your autoscaler first.” We improvised it in the AWS console while our service degraded, and afterward we wrote it down in a document we titled “The Red Button.”

By uttu

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *