Handling Surges in Petabyte-Scale Streaming Systems by Doing Nothing
Breakout Session
When streaming data at petabyte scale, one of the most painful on-call scenarios is handling sudden traffic surges that overload servers, trigger cascading failures, and wipe out service availability across a large blast radius. At modern throughput levels, scaling operations are simply not fast enough to prevent unexpected 10–20x spikes from taking down dozens of streaming pipelines and their neighbors. The classical mitigation is to overprovision replicas and headroom, add proactive alerting, and hope to “react quickly.”
In this talk, we present a TCP-based congestion control approach that tackles the problem at its root and eliminates the need for manual on-call intervention.
At Pinterest, we have productionized this TCP-based flow control solution in a 50 GB/s streaming system that powers machine learning across the company. By setting up the appropriate end-to-end flow control mechanisms, we guard against sudden surges of any magnitude by propagating backpressure gracefully, predictably, and fully autonomously.
We will walk through the key concepts in networking, memory management, and backpressure that matter in large-scale streaming systems, and then unpack the exact mechanism we built to solve this problem. The audience will leave with a set of production-ready ideas and patterns that can be replicated in their own streaming environments in ways that are far more cost-efficient and operationally lightweight than classical solutions.
Beyond eliminating the catastrophic risk of sudden traffic surges, we will share concrete and replicable takeaways from running these concepts in production at scale, including:
Designing streaming topologies that rely on backpressure instead of excess capacity
Safely transforming scaling and load balancing into reactive operations, reducing unnecessary early alerting and interventions
Simplifying capacity planning for organic growth
Lowering infrastructure cost by running denser workloads with minimal buffer headroom
Jeff Xiang