Beyond Watermarks: Custom Flink Operators for Feature-Trigger Synchronization
Breakout Session
To catch malicious behavior on Europe's largest second-hand marketplace, every second matters - but so does accuracy, which relies on data completeness.
At Vinted, within the Trust domain, we process millions of events daily to enable ML models and rule engines to detect malicious users and policy-violating content in real-time. Using an event-driven architecture, we rely on events to trigger rule evaluations. These also require other computed features in a timely manner for accurate detection results. If these lag behind, models operate on stale or even missing data.
In this talk, I will first elaborate on why built-in event-time processing operators in Flink didn’t meet the requirements to guarantee feature-trigger alignment. For starters, Flink only has (temporal) inner joins out of the box. Then, I will share the custom solutions we built. In particular, you will learn two distinct patterns, suitable for different data velocities and scenarios.
Pattern 1: Feature Store and Ingestion Notification Alignment
We have several complex features computed by dedicated pipelines, writing to our Feature Store. Here, we rely on ingestion notifications to signal feature availability. We buffer triggers until all required feature groups report readiness or timeouts are reached. The downstream service then evaluates rules once the trigger hits, and we know that the necessary features are available.
Pattern 2: Direct Trigger Enrichment
When data directly relates to a trigger event - or latency is critical - it makes sense to attach it directly to triggers. The exact implementation is still split into two cases depending on data velocity; we can have slowly-changing dimensions requiring longer buffering, but also quickly updating streams needing short alignment windows.
Key Takeaways
After this talk, you’ll know when feature-trigger alignment is needed in streaming, the tradeoffs between aligning triggers with Feature Store ingestions or using direct enrichments, and the implementation details of our custom Flink operators. I’ll also share insights on how deployment strategies and certain outages affect enrichment correctness.
Csanád Bakos
Vinted