Scaling Streaming Computation at LinkedIn: A Multi-Year Journey with Apache Flink

Breakout Session

At LinkedIn, stream processing is the foundation for delivering real-time features, metrics, and member experiences across products like Ads AI, Search, Notifications, and Premium. Over the past four years, we’ve built and evolved a fully managed stream processing platform based on Apache Flink to meet increasing demands for scale, state, and reliability.

This talk shares our journey from building a self-serve, Kubernetes-native Flink platform to supporting high-throughput, stateful applications with managed Flink SQL. Today, our platform powers thousands of mission-critical pipelines and enables developers to author and deploy jobs declaratively, while abstracting away operational complexity.

As workloads grew in complexity and state size, we tackled state management challenges head-on: optimizing checkpointing and recovery, evaluating state storage options, and navigating trade-offs in scalability, cost, and performance. We’ll walk through how we scaled stateful joins, onboarded high-QPS applications, and migrated from Samza and Couchbase to Flink SQL - achieving over 80% hardware cost savings.

Key highlights include:

- Building a self-serve Flink platform on Kubernetes with split deployment, monitoring, alerting, auto-scaling, and failure recovery

- Scaling Flink SQL: challenges and lessons from supporting large-stateful jobs, including state storage choices, state garbage collection (GC) failures, and inefficient job sizing

- Diagnosing performance bottlenecks and building a resource estimation model for join-intensive Flink SQL pipelines

- Developing tooling for safe migrations, automating reconciliation and backfill workflows, and enabling end-to-end validation

We’ll share the lessons learned and platform investments that helped us scale Apache Flink from early experimentation to a robust, production-grade streaming engine. Whether you're building a Flink-based platform or migrating stateful pipelines at scale, this talk offers actionable insights from operating Flink in production.

Weiqing Yang