Breakout Session
In this presentation, I will introduce disaggregated state, a novel architecture we propose in Flink 2.0. It leverages Distributed File Systems (DFS) as the primary storage solution to address critical challenges faced by Flink in the cloud-native era, including:
● Limited local disk capabilities in containerized environments.
● Inefficient resource utilization due to compaction in the existing state model.
● Scalability bottlenecks encountered with large state sizes
● Time-consuming and cumbersome checkpointing processes.
However, simply extending the state store to read/write from remote DFS is insufficient due to the existing blocking execution model in Flink. PoC results show that directly reading/writing to DFS can be 95% slower than local disks.
Presentation Highlights:
● Overcoming the Blocking Execution Model: Explore solutions for asynchronous execution model and APIs to facilitate non-blocking execution. Addressing challenges like preserving processing order for identical keys, checkpointing synchronization, and handling disordered watermarks and timers.
● Grouping Remote State Access: Enable efficient retrieval of remote state data in batches, minimizing the impact of round-trip times associated with RPC calls.
● Efficient Fault Tolerance and Fast Rescaling: integrate the checkpointing mechanisms with the disaggregated state store for efficient fault tolerance and fast rescaling.
● Flink 2.0 Release: The first beta version of disaggregated state is integrated with asynchronous SQL operators.
Overall, this presentation aims to provide a comprehensive overview of disaggregated state storage we propose in Flink 2.0, empowering Flink to continue as a leading stateful stream processing engine in the cloud-native landscape. Please refer to FLIPs (FLIP-423 to FLIP-428) we've published for further details.