Streaming CDC to Apache Iceberg at Scale with Apache Kafka: Best Practices for Enterprise Lakehouse Architectures

Breakout Session

In today's data-driven enterprises, the ability to efficiently stream change data capture (CDC) events from operational databases into analytical platforms has become a critical capability. This session explores the architectural patterns and operational best practices for building robust, scalable CDC pipelines that deliver data to Apache Iceberg using Apache Kafka as the streaming backbone.

As organizations increasingly adopt lakehouse architectures for their analytical workloads, the challenge shifts from simply moving data to doing so optimally and at scale. This session provides practical guidance on setting up end-to-end CDC streaming pipelines, covering key considerations such as schema evolution handling, partition strategy optimization, compaction policies, and write performance tuning specific to Iceberg tables.

Attendees will learn proven techniques for managing high-volume CDC streams, including strategies for handling late-arriving data, managing small file problems, optimizing merge operations, and implementing effective monitoring and alerting. We'll also discuss critical operational measures needed when scaling these pipelines to handle enterprise workloads, including resource allocation, backpressure management, and ensuring data consistency across distributed systems.

By attending this session, your contact information may be shared with the sponsor for relevant follow up for this event only.

Vinayaka Gangadhar

AWS

Yashika Jain

AWS