Breakout Session

Enabling Flink's Cloud-Native Future: Introducing Disaggregated State in Flink 2.0

< All 2024 Sessions

In this presentation, I will introduce disaggregated state, a novel architecture we propose in Flink 2.0. It leverages Distributed File Systems (DFS) as the primary storage solution to address critical challenges faced by Flink in the cloud-native era, including:

● Limited local disk capabilities in containerized environments.

● Inefficient resource utilization due to compaction in the existing state model.

● Scalability bottlenecks encountered with large state sizes

● Time-consuming and cumbersome checkpointing processes.

However, simply extending the state store to read/write from remote DFS is insufficient due to the existing blocking execution model in Flink. PoC results show that directly reading/writing to DFS can be 95% slower than local disks.

Presentation Highlights:

● Overcoming the Blocking Execution Model: Explore solutions for asynchronous execution model and APIs to facilitate non-blocking execution. Addressing challenges like preserving processing order for identical keys, checkpointing synchronization, and handling disordered watermarks and timers.

● Grouping Remote State Access: Enable efficient retrieval of remote state data in batches, minimizing the impact of round-trip times associated with RPC calls.

● Efficient Fault Tolerance and Fast Rescaling: integrate the checkpointing mechanisms with the disaggregated state store for efficient fault tolerance and fast rescaling.

● Flink 2.0 Release: The first beta version of disaggregated state is integrated with asynchronous SQL operators.

Overall, this presentation aims to provide a comprehensive overview of disaggregated state storage we propose in Flink 2.0, empowering Flink to continue as a leading stateful stream processing engine in the cloud-native landscape. Please refer to FLIPs (FLIP-423 to FLIP-428) we've published for further details.

Yuan Mei

Alibaba

Download