The Invisible Ceilings of Cloud Infrastructure
What Pinterest’s EC2 throttling incident—and others like it—teach us about building resilient systems in a black box world.
When the Cloud Slows Down, Quietly
It started with user complaints.
Pages were loading slowly. API calls were timing out. Internal dashboards lagged behind real-time. Something was wrong—but none of the usual suspects showed up.
No alerts fired. No dashboards lit up. Logs were clean. And yet the system was clearly sick.
Engineers dug in but couldn’t find the cause. CPU? Normal. Memory? Fine. Logs? Clean. The system looked healthy. But it wasn’t behaving that way.
It was as if the infrastructure itself had hit a patch of black ice—nothing visibly wrong, yet everything more slippery than it should be.
Just a creeping, quiet slowness.
And that’s the danger.
We’ve built layers of tooling to catch crashes, errors, and spikes. But we’ve left ourselves exposed to something harder to catch: degradation. Not failure, just… less.
Pinterest isn’t alone. AWS EBS had its own silent meltdown in 2011. Kubernetes throttles CPU under the radar. GCP and Azure enforce bandwidth ceilings that most teams only discover the hard way.
This post digs into what happened to Pinterest, how they uncovered it, and how it connects to a wider pattern of silent failure in cloud infrastructure. If you think your dashboards are enough, read on.
The Cliff – When Observability Fails
It started with a feeling. Something wasn’t right. Services were getting slower. But nothing was obviously broken. CPU was fine. Memory was fine. Disk was fine. Logs? Clean.
But users noticed. Engineers noticed. It was like walking on soft ground—stable enough to stand, but clearly shifting underneath.
This is what failure looks like in the cloud: not a crash, but a soft cliff.
Observability, in theory, should have caught this. But observability is a flashlight—you only see what you’re aiming at. And Pinterest wasn’t watching for ENI-level bandwidth limits. Why would they be?
Traditional metrics are biased. We measure what’s easy, what’s visible, what the provider exposes. But most cloud failures aren’t loud—they’re invisible ceilings you only discover when you hit them.
Diagnosing the Ceiling – The Hunt and the Hard Lesson
With no clear signals, Pinterest engineers had to look deeper. Dashboards gave way to kernel counters. Tools like ethtool
, nstat
, and /proc/net/dev
offered a view into the system’s internals.
There it was: bandwidth throttling. No errors. No metrics. Just silent enforcement.
They had crossed AWS’s token bucket limits for EC2. Too much traffic, too fast—and the bucket ran dry. Packets dropped. Latency climbed.
Synthetic traffic helped confirm it. The platform was behaving as designed—just not as expected.
So the team scaled up. More cores, more bandwidth—problem solved?
Not quite.
Bigger instances meant bigger bursts. The token bucket drained faster. Throttling hit harder.
So they flipped the script. Smaller instances. More ENIs. Shaped traffic. Spread the load horizontally.
And it worked.
Throughput stabilised. Latency dropped. The system didn’t need more muscle—it needed more agility.
It’s counterintuitive but common: in the cloud, scaling up can push you closer to invisible thresholds. Bigger can be burstier, and burstier means punished.
We’ve seen it with AWS API Gateway too. Hit the rate limit and you’re out. Doesn’t matter how big you are—just how fast you hit.
The Pattern – Black Box Infrastructure Has Hidden Modes
Kubernetes has its own version of soft cliffs—CPU throttling that doesn’t show up until response times crawl. Systems degrade before they fail, and our tooling rarely catches the difference.
In many ways, what Pinterest experienced wasn’t just a case of throttling. It was a textbook gray failure — the system didn’t crash. It just quietly degraded.
Microsoft Research coined this term in their 2017 paper, Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. They found that many real-world failures weren’t binary. Systems didn’t break. They just misbehaved—slow disks, stuck CPUs, flaky NICs, uneven timeouts.
The key problem? Differential observability. One part of the system sees green. Another part sees red. Monitoring doesn’t catch it because each layer is telling a different story.
Gray failures whisper, they don’t shout.
And they’re dangerous—because they pass health checks, but still hurt performance and correctness.
Pinterest’s ENI throttling incident fits this perfectly: healthy-looking metrics masking degraded user experience. Retry storms, cold starts, network drops—all signs of partial failure modes that look “healthy” at the system level.
What Microsoft saw at scale is echoed here:
- Health checks pass, but disks slow down
- Threads get scheduled, but don’t run
- Packets get dropped silently
These aren’t outliers. They’re structural. And they demand we move past binary thinking.
We need to:
- Embrace multi-angle observability (not just metrics, but symptoms)
- Accept that healthy/unhealthy is too simplistic
- Simulate not just failure—but weirdness
The cloud is a system built of abstractions. And those abstractions fail, not like light switches—but like people. Gradually, inconsistently, and quietly.
This wasn’t a bug. It was an architectural blind spot.
Cloud infrastructure is full of failure modes that hide in the shadows. Cold starts. Retry storms. Noisy neighbours. Secret limits. Resource contention. Scheduler starvation.
They don’t show up until you cross the line—and the line is often undocumented.
We assume observability equals visibility. But most tools only show what providers choose to expose. And platforms have incentives to simplify.
Behind the clean dashboards, you’re still riding on shared hardware. Shared networks. Shared assumptions.
Failure here doesn’t crash your app—it just slows it until someone notices.
Design for the Unknown Unknowns
Pinterest didn’t just fix the symptom. They changed the posture.
They started poking at the ceiling.
Simulated traffic. Spiky patterns. Failure injection. They wanted to know how and when the platform would push back.
They stopped relying on abstractions and started watching behaviour. They shaped traffic. Spread load. Designed around small, disposable components. Assumed the platform wouldn’t warn them.
And they were right.
The systems that survive aren’t the ones with the best dashboards. They’re the ones designed by people who expect to be surprised—and build like it.
Final Thought – Cloud Is a Contract, Not a Crystal Ball
You don’t own the cloud. You lease performance. You lease capacity. You lease confidence.
But leases come with clauses. And most of them are invisible.
The cloud is a black box with predictable patterns—but only if you go looking for them. Don’t expect truth. Expect thresholds.
If you want resilience, stop trusting the SLA and start testing the system.
Because you’re not building on metal. You’re building on someone else’s assumptions.
Everyone Has a Ceiling
This piece was inspired by Pinterest Engineering’s post on handling network throttling. Their team’s willingness to share the details made this story—and the broader discussion around silent cloud failure—possible.
Thanks to Tom Watson for surfacing the Microsoft Research work on gray failure that informed this post, and to Marcus Estrin for thoughtful feedback that shaped both the structure and clarity of the narrative.
Pinterest’s story is rare only because it was shared so openly. Most teams have their version—same symptoms, same silence, same scramble. You chalk it up to ‘weird AWS stuff’ and move on.
But these aren’t anomalies. They’re structural. They’re part of the system behaving exactly as it was built to—just not as you assumed.
If you build in the cloud, you’ll hit a ceiling. Maybe not today. Maybe not loudly. But it’s there.
Stay curious. Expect nothing for free. And never confuse silence for stability.
Because the cloud doesn’t shout. It whispers.
Further Reading
If you want to explore more examples of silent failure and hidden limits in cloud infrastructure—and the theory behind why they’re so hard to detect:
- AWS EBS Outage (2011) – The original post-mortem that exposed silent volume degradation.
- Kubernetes CPU Throttling – Why containers slow down even when CPU metrics look fine.
- Cloud Network Throttling (Clockwork) – How undetected bandwidth ceilings sabotage performance.
- Oracle Cloud NAT Gateway Throttling – OCI’s approach to shared resource limits and isolation.
- Gray Failure: The Achilles’ Heel of Cloud-Scale Systems (Microsoft Research) – A foundational study on why partial, hard-to-detect failures are the most dangerous kind.