Background

Breakout Session

Enabling Flink's Cloud-Native Future: Introducing Disaggregated State in Flink 2.0

In this presentation, I will introduce disaggregated state, a novel architecture we propose in Flink 2.0. It leverages Distributed File Systems (DFS) as the primary storage solution to address critical challenges faced by Flink in the cloud-native era, including:  

● Limited local disk capabilities in containerized environments.  

● Inefficient resource utilization due to compaction in the existing state model.  

● Scalability bottlenecks encountered with large state sizes  

● Time-consuming and cumbersome checkpointing processes.  

  However, simply extending the state store to read/write from remote DFS is insufficient due to the existing blocking execution model in Flink. PoC results show that directly reading/writing to DFS can be 95% slower than local disks.  

  Presentation Highlights:  

● Overcoming the Blocking Execution Model: Explore solutions for asynchronous execution model and APIs to facilitate non-blocking execution. Addressing challenges like preserving processing order for identical keys, checkpointing synchronization, and handling disordered watermarks and timers.  

● Grouping Remote State Access: Enable efficient retrieval of remote state data in batches, minimizing the impact of round-trip times associated with RPC calls.  

● Efficient Fault Tolerance and Fast Rescaling: integrate the checkpointing mechanisms with the disaggregated state store for efficient fault tolerance and fast rescaling.  

● Flink 2.0 Release: The first beta version of disaggregated state is integrated with asynchronous SQL operators.  

  Overall, this presentation aims to provide a comprehensive overview of disaggregated state storage we propose in Flink 2.0, empowering Flink to continue as a leading stateful stream processing engine in the cloud-native landscape. Please refer to FLIPs (FLIP-423 to FLIP-428) we've published for further details.

Yuan Mei

Alibaba