From Batch to Real Time: Operating Cassandra CDC with Debezium at Datadog Scale

Breakout Session

At Datadog, the Metrics Query Activity feature relies on fast faceted search across operational data stored in Cassandra. The previous replication model used scheduled batch jobs that queried Cassandra by partition key and copied the data into Elasticsearch. This created heavy read pressure on production clusters, introduced operational complexity, and resulted in a four hour delay before changes became visible downstream. The batch jobs ran on a fixed schedule and were enabled for only a limited subset of customers. With the Cassandra cluster sustaining write volumes exceeding 30,000 writes per second, extending this approach to the full customer base would have required an increase in job execution rate and query volume.

This talk presents how we replaced this batch approach with a real time streaming architecture based on Cassandra CDC and the open source Debezium Cassandra connector, including upstream contributions to the project. By capturing commit logs directly and streaming changes into Kafka, we removed the need for read intensive extraction jobs. A downstream Kafka Connect Elasticsearch sink then applies updates as they arrive, keeping indexed documents aligned with the source of truth within seconds.

Supporting Datadog’s write volume required ensuring the CDC pipeline could process more than 30,000 writes per second with resilient behavior. We tuned Debezium’s Kafka producers and evaluated the system under peak load, while verifying at least once delivery and clean recovery from connector issues to maintain eventual consistency downstream.

The impact is reflected in several key metrics. Replication delay fell from four hours to under ten seconds. Eliminating read heavy extraction jobs removed pressure on Cassandra and created opportunities for future cluster downscaling. The new architecture also reduced operational cost by an estimated 46 percent while providing a streaming model that scales naturally with write throughput and isolates OLTP workloads from downstream processing.

Attendees will learn how to implement Cassandra CDC with Debezium in a high volume environment, how to tune and scale Debezium and Kafka to handle demanding write workloads, how to migrate safely from batch replication to streaming, and the practical lessons we learned while operationalizing Cassandra CDC at Datadog scale.


Joan Gomez

Datadog

Alejandro Huertas

Datadog