Breakout Session

Bridging the Kafka/Iceberg Divide

< All 2024 Sessions

Apache Kafka® has long been the de-facto standard for managing and distributing data in motion, and Apache Iceberg has been on the rise in a similar role for data at rest. A common pattern is to collect streaming data from Kafka into Iceberg so that it can be efficiently queried. However, this process can be complex, expensive, and error-prone. It entails a sequence of connectors, streaming, and batch jobs. These data pipelines consist of multiple steps where each step needs to be manually configured, managed, and scaled. This also results in multiple copies of the data without a source of truth. As the scale grows, so does complexity, cost, and toil.

We think that most of this complexity is avoidable. Data in motion and data at rest are just two sides of the same coin. At Confluent, we are flipping the problem on its head: we establish an equivalence between Topics and Tables and build a system that transparently maps Topics into Tables and vice-versa. This talk covers the practical application of this philosophy in Confluent's data architecture and explains how you can apply the approach to your own data pipelines. We also cover some of the deeper technical details, such as how to map schemas and schema evolution from Kafka's domain into Iceberg, and how we built a scalable and manageable system to continuously materialize hundreds of thousands of topics into Iceberg tables.

This talk has something for everyone! Those who are curious about Iceberg and want to know how to integrate it to their Kafka infrastructure will get a good introduction, and those who spend too much time debugging complex Iceberg materialization pipelines will get some tips for simplifying their systems.

Bridging the Kafka/Iceberg Divide

John Roesler

Confluent

Vasiliki Papavasileiou

Confluent