How Datadog Runs Its Streaming Platform
Breakout Session
Operating Kafka in production is hard. Operating thousands of Kafka clusters globally—without customer-visible incidents—is an entirely different problem.
At Datadog, Kafka is the backbone of our real-time data ingestion and streaming platform, processing petabytes of data every day across thousands of clusters and tens of thousands of brokers. At this scale, failures are rarely loud or localized. Instead, they surface as subtle latency shifts, uneven consumer lag, stalled rebalances, or slow partitions that quietly degrade customer experience if not caught early.
The hardest part is not detecting symptoms—it’s identifying root causes fast enough to prevent impact. Standard Kafka monitoring (JMX metrics, broker health, consumer lag) breaks down when incidents span multiple clusters, teams, and regions.
This talk explores how Datadog runs a massive Kafka fleet in production while minimizing incidents and customer impact, and the observability practices that make this possible. Through real production scenarios, we’ll show how we correlate signals across brokers, consumers, storage layers, and infrastructure to understand why something is wrong—not just that it is.
We’ll dive into the technical foundation behind this approach:
Partition-level throughput and latency analysis to detect emerging hot spotsContinuous profiling to identify GC and allocation issues before they affect tail latencyDistributed tracing to follow slow produce and fetch paths across services and clustersDynamic instrumentation to debug live Kafka services safely, without redeploymentsFleet-wide dashboards, anomaly detection, and SLOs to prioritize issues that matterBeyond tooling, we’ll share the operational patterns we rely on to keep Kafka stable at scale: detecting configuration drift across thousands of clusters, preventing cascading failures, and shifting from reactive firefighting to predictive capacity and risk management.
This session is for engineers running Kafka in serious production environments who want to understand what it takes to operate streaming systems at global scale—and how modern observability enables reliability when failure is the default state.
Nandini Singhal
Datadog