Flink Beyond Streaming: Building a Production-Ready Batch Platform at LinkedIn

Breakout Session

Apache Flink is widely known for streaming, but running Flink Batch as a reliable, repeatable “default” engine for critical offline workloads requires platform work that does not show up in typical examples. In this session, we will share how we productionized Flink Batch and Flink SQL for large batch pipelines-covering the engineering choices, operational guardrails, and lessons learned when scaling adoption at LinkedIn.

We will start with the platform foundations needed to make batch SQL dependable in production: packaging and deployment patterns for batch SQL jobs, reducing configuration drift between job logic and orchestration, and the minimum observability you need to debug regressions quickly. Then we will go deep on concrete performance and scalability work:

SQL query optimizations such as nested projection and filter pushdown to reduce compute and I/O.

Remote shuffle with Celeborn to overcome shuffle bottlenecks and improve throughput predictability for the largest batch workloads.

Workflow orchestration with Apache Airflow to schedule, monitor, and recover batch pipelines with minimal operator toil.

Operational observability using Flink HistoryServer for post-job diagnostics and faster root-cause analysis.

To make it real, the talk is anchored by two production “tales from the trenches”:

Scalability and Reliability

Training Data at Scale: Lessons from LinkedIn Ads We will walk through how we optimized a large machine learning model training data pipeline running on Flink Batch including changes in SQL planning, execution and shuffle architecture, and how these improvements enhanced runtime performance and operational stability. Also, will share before and after numbers to showcase the significant scaling improvements.

Developer Experience and Maintainability

Scaling Central Interaction Logging ingestion (online + offline)CIL is a central platform that provides a unified view of users' interactions across in online and offline environments so downstream systems - including AI models - can rely on a single consistent source. We will share the bottlenecks encountered when scaling onboarding to many near-identical SQL ingestion jobs: manual job/DAG scaffolding, fragile configuration wiring, schema-only testing, and recurring Avro/schema maintenance.

Audience takeaways: a practical checklist for running Flink Batch at scale (query tuning, shuffle choices, orchestration, and observability), and patterns for onboarding many SQL jobs with less duplication, better testability, and safer schema/dependency evolution.

Archit Goyal