StreamLink: Real-Time Data Ingestion at OpenAI Scale

Breakout Session

In the modern data lakehouse, real-time ingestion isn’t just a nice-to-have – it’s a foundational capability. Model training and evaluation, human analysts, and autonomous AI agents all demand fresh, trustworthy data from diverse sources at massive scale. These expectations are a challenge for platform teams – but they’re also an opportunity to unlock massive business value.

At OpenAI, we built StreamLink, a real-time streaming ingestion platform for the data lakehouse, powered by Apache Flink. StreamLink ingests 100+ GiB/s of data from Kafka into Delta Lake and Iceberg tables, supporting 2000 datasets across 20+ partner teams.

In this session, we’ll dive deep into the design and implementation of StreamLink. We’ll explore our Kubernetes‑native deployment model (Flink K8s Operator), adaptive autoscaling heuristics, and self‑service onboarding model – all of which keep platform operations lean. Attendees will take away concrete patterns for building scalable, manageable real-time ingestion systems in their own data lakehouse.


Adam Richardson

OpenAI