Breaking Kafka at Scale: Lessons from Running 70K Topics on a Single Cluster

Breakout Session

Breaking Kafka isn’t that hard, deploying 70K topics on a single cluster will certainly do the trick. High availability quickly triples the blast radius, pushing past the 200K partition stability threshold. At this scale, stability becomes fragile, and keeping production alive feels more like firefighting than engineering.

In this session, we’ll share our real-world Kafka journey: a technical migration from an aging, single-tenancy architecture to a massively scaled, multi-tenant platform. We'll detail how we engineered this platform to handle billions of events per day, power a super-fast UI, and maintain real-time replication underneath.

We will dive into the internals of our overwhelmed Kafka cluster, showcasing how we utilized Kafka Connect and Debezium running on Kubernetes to replicate customer data from MySQL to SingleStore in under 10 seconds. Finally, we’ll share the concrete, quantifiable outcomes: an 80% reduction in Kafka infrastructure costs and the elimination of entire classes of stability issues.

This talk is packed with practical lessons, architectural trade-offs, and hard-earned insights. It is ideal for Intermediate to Senior Data Engineers, Architects and teams operating Kafka at scale (on-prem or cloud) facing cost, performance, or stability challenges.

Ziv Fridfertig

Skai