Breakout Session

Addressing Streaming ETL Pipelines Challenges: Delving into Flink CDC

< All 2024 Sessions

Data freshness significantly impacts the value of data insights, particularly for business data housed within databases. Establishing a more real-time synchronization pipeline is crucial for deriving actionable insights from this data. However, constructing such a pipeline encounters various challenges. Common issues include frequent table schema changes in the upstream database and the need to add or delete tables according to evolving business requirements. Moreover, flexibly scaling resources during historical reading and log reading stages proves difficult, while synchronizing multiple tables or the entire database consumes substantial resources. Additionally, constructing job workflows using the DataStream API poses its own complexities.

During this session, I will delve into the fundamental design and implementation of Flink CDC, exploring how it addresses these challenges. Flink CDC, an end-to-end streaming ETL framework built on Flink, elegantly tackles these obstacles to construct stable, real-time streaming pipelines. This encompasses automatically recycling idle resources for scalability, supporting the automatic synchronization of upstream DDL/DML changes with downstream systems, and dynamically modifying captured tables. Furthermore, I will demonstrate how Flink CDC offers a simple and user-friendly development experience through YAML language.

Leonard Xu

Alibaba Cloud

Download