Beyond Documentation: AI Agents as Flink Debugging Partners
Breakout Session
Operating over 1,000 Apache Flink applications at Stripe has taught us that even the most comprehensive documentation can't eliminate the cognitive load of debugging complex distributed systems.
Non experienced flink developers routinely juggle multiple tools—Flink UI, Prometheus metrics, Splunk logs—while cross-referencing extensive runbooks to diagnose failures. This operational overhead inspired us to explore an unconventional solution: integrating AI coding agents directly into our Flink platform.
In this talk, we'll share how we transformed Flink debugging from a multi-tool treasure hunt into an intelligent, conversational experience. Our integration enables AI agents to:
-- Automatically fetch and correlate metrics
-- Parse logs for relevant error patterns
-- Navigate our extensive Flink documentation and runbooks
Generate contextual debugging suggestions
This talk shares our implementation journey, quantitative improvements (x% faster diagnosis), and the critical human-in-the-loop patterns that ensure safety. You'll see real debugging sessions, learn how we chose the right model, and understand where it fails. We'll conclude with actionable insights for teams considering AI-assisted operations.
Pratyush Sharma
Stripe
Seth Saperstein
Stripe