Breakout Session

Highly Reliable Streaming Ingestion from Kafka to Iceberg Tables with Seamless Schema Evolution

< All 2024 Sessions

Streaming data ingestion often has to deal with evolving data formats. Schema of incoming records can change over time as new attributes get added by producers. Even when schemas evolve or records are invalid, the stream processing should continue without interruptions. Graceful handling of schema changes and data errors is critical for zero-downtime in modern real time data platforms.

Apache Iceberg® is an open table format that allows flexible metadata management in addition to ACID transactions, time-travel, rollback, and so on. Iceberg is an ideal choice to enable seamless schema evolution and strong reliability for streaming workloads.

In this session, we will leverage Iceberg capabilities for streaming ingestion with schema evolution from Kafka into data lakes. We'll also discuss reliability considerations like atomic commits, and invalid record isolation for streaming use cases to make this architecture resilient and highly available. We will showcase Iceberg core capabilities like atomic writes, schema evolution, and so on.

Next, we’ll explore the end-to-end streaming ingestion flow from Kafka to Iceberg tables via Spark Structured Streaming. The flow will also contain common transformations to clean up and enrich the incoming records. During ingestion, Spark will validate records and reject invalid ones to specially maintained error files. This ensures only clean, validated data lands in Iceberg tables without restarting streams on errors.

We’ll also demonstrate seamless schema evolution capabilities in Iceberg. We can alter the table schema by adding new columns over time or modify column types without downtime. Iceberg manages this by creating new schema versions and evolving table metadata at commit time while keeping existing data compatible. Old and new data files with their respective schemas coexist together.

Noritaka Sekiyama

Amazon Web Services

Download