Joining Streams Through Time: As-Of Joins with Spark 4's new transformWithState API

Breakout Session

Spark Structured Streaming offers powerful built-in operators for stream-stream joins, but what happens when you need to join two streams based on temporal proximity rather than exact key matches? As-of temporal joins—where each event from one stream matches the most recent corresponding record from another—are essential for scenarios such as currency conversion at transaction time, enriching trades with the latest quotes, or correlating sensor readings with their nearest calibration values.

Apache Spark 4.0 introduces transformWithState, a next-generation stateful processing operator that replaces the limitations of flatMapGroupsWithState. This talk demonstrates how to leverage transformWithState to build production-grade as-of temporal joins between two event streams.

We'll explore the core building blocks: using MapState to maintain versioned lookup data keyed by time, processing incoming events to find the nearest temporal match, and managing state lifecycle with TTL-based eviction to prevent unbounded growth. You'll see how the new object-oriented StatefulProcessor model separates concerns cleanly—handling input events in handleInputRows and expired state in handleExpiredTimer—making complex temporal logic more maintainable than ever before.

Through live code examples in both Scala and Python (transformWithStateInPandas), attendees will learn practical patterns for buffering late-arriving reference data, handling out-of-order events using watermarks and timers, and emitting joined results only when temporal alignment is confident. We'll also demonstrate how to use Spark 4's new state data source reader to debug and monitor the internal state of your temporal join during development.

Key takeaways:

When and why to implement custom temporal joins instead of using built-in stream-stream joinsStep-by-step implementation of as-of joins using transformWithState composite state typesState management strategies, including TTL, timers, and watermark integrationDebugging techniques using the state data source readerWhether you're building financial trading systems, IoT analytics pipelines, or real-time ML feature stores, this talk equips you with patterns to solve temporal alignment challenges in Spark Structured Streaming.


Carlos Rodrigues

Databricks