Background

Breakout Session

CDC Pipelines to the Data Lakehouse Using Apache Hudi and Apache Flink

Change Data Capture (CDC) is crucial in capturing and disseminating real-time changes from upstream OLTP databases. Data Lakehouses provide scalability for data exploration on open data formats, but many struggle with record level updates in CDC workloads. Apache Hudi® has emerged as a comprehensive lakehouse platform purpose-built for managing mutable data, facilitating near real-time insertions, updates, and deletes with a robust incremental processing framework. Hudi’s integration with Apache Flink® bridges the gap between the streaming and data lakehouse communities, enabling low-latency, real-time data applications for various use cases such as change log reconcilation, ETL pipelines, streaming joins, and constructing materialized views.  

  This talk unveils a powerful one-liner for unleashing CDC using Hudi and Flink, with a detailed dissection of its underlying implementation. We showcase how Hudi's powerful runtime can apply upstream database changes using fast upsert operations, while self-managing tables on the lakehouse. Talk highlights Hudi's novel approach to concurrency control for high-volume streaming multi-writers: Non-Blocking Concurrency Control (NBCC). NBCC overcomes the shortcomings of alternative technologies, which support only optimistic concurrency control, burdening streaming systems with retries and wasted computing resources.  

  Attendees will learn how Hudi unlocks use-cases like real-time multi-dataset JOINs and highly concurrent streaming workloads, including change log ingestion. The talk also provides a broad overview of the current challenges associated with running production-grade CDC to a lakehouse, while showing how Hudi further enables incremental processing and change capture downstream in the data lakehouse.

Shiyan Xu

Onehouse

Nadine Farah

Onehouse