An ounce of prevention is worth a pound of cure - Fix data clustering in streaming write to Iceberg

Breakout Session

Apache Flink is commonly used to ingest continuous streams of data to Apache Iceberg tables. But it lacks the ability to organize data at write time, which can lead to small files and poor data clustering problems for many use cases. Regular table maintenance, such as compaction and sorting, can help to remediate the problems. But prevention is usually cheaper than remediation.

In this talk, we will present a solution that can prevent those problems during streaming ingestion. Range distribution (sorting) is a common technique for data clustering, and many batch engines support it when writing to Iceberg. We will describe the range partitioner that was contributed to Flink Iceberg sink (released in Iceberg 1.7 from late 2024). We will deep dive into how to handle the challenges of unbounded streams, organically evolving traffic patterns, low-cardinality and high-cardinality sort columns, and rescaling writer parallelism.

By the end of the session, you will understand the design choices and tradeoffs, and why it is applicable to broad use cases of streaming ingestion.


Steven Wu

Snowflake