Breakout Session

Building a Scalable Flink Platform: A Tale of 15,000 Jobs at Netflix

< All 2024 Sessions

At Netflix, we operate over 15,000 Apache Flink jobs, processing over 60 PB of data per day. Apache Flink is widely used across many engineering organizations within Netflix to power stream processing use cases such as data movement, personalization, messaging, and finance. This diverse usage of Flink manifests itself in different job characteristics, such as cluster size (up to 2000 parallelism), state size (stateless to 4 TB), and complexity of job graphs.

In this talk, we will share our journey in building a scalable Flink platform. We will highlight the different components and systems that we've built to quickly onboard new use cases and to enable users to manage their Flink jobs effectively. We'll also share how we addressed the challenges that we've encountered while operating Flink clusters at scale.

Building a Scalable Flink Platform: A Tale of 15,000 Jobs at Netflix

Mark Cho

Netflix

Mingliang Liu

Netflix