Dynamic Kafka, Static Sleep: Taming Multi-Cluster Streams with Flink at OpenAI

Breakout Session

At OpenAI, Kafka streams don’t sit still: a single logical “stream” can span multiple clusters and sometimes multiple regions, and the underlying topology changes as we migrate, scale, or fail over. That’s great for availability—but it’s a sharp edge for stream processors that assume “one cluster, stable topics, one offset story.” (Spoiler: that assumption dies first.)

This talk shares our journey at OpenAI to make Apache Flink’s DynamicKafkaSource fit that reality, using our Kafka to Warehouse ingestion system “StreamLink” built on Flink as a case study. We’ll walk through the mental model shift from “topics on a cluster” to “a stream over an ever-changing infra topology,” what worked, and where we ran into the most interesting edge cases—around offsets, state, and operational safety when Kafka topology evolves underneath a running Flink job.

Rather than presenting a polished fairy tale where every checkpoint is happy and every offset is deterministic, we’ll focus on the decisions and tradeoffs: the approaches we considered, the guardrails we’re putting in place, what we’re validating, and the questions we think the community should care about as dynamic consumption of Kafka becomes more popular. We’ll also cover what we’re contributing back to OSS across core implementations and APIs (Java/Python/Table/SQL), and a practical roadmap.

You’ll leave with patterns you can apply to multi-cluster Kafka + Flink deployments, a checklist of “gotchas” to watch for, and a few ideas you can steal — because if Kafka is going to be dynamic, your consumption strategy should be too (preferably without becoming dynamically on-call)

Bowen Li

OpenAI

Xin Gao

OpenAI