Life as a Kafka Admin: Lessons from Running 30+ Clusters in Production

Breakout Session

Operating Apache Kafka in production is very different from just “using Kafka as a developer”. Since 2021, I’ve worked as a Kafka Admin responsible for more than 50 clusters across multiple regions, helping dozens of teams build on top of Kafka while keeping the platform stable and predictable. Over time, the patterns repeat: too many or too few partitions, services calling slow external APIs in the middle of stream processing, painful rebalances, clients that “cannot connect”, and users who just want the platform to “work” without learning all the internals.​

This talk shares the practical lessons learned from living in that world every day. It covers how to design and review topics and partitioning, how to deal with rebalances and skew, how to debug connection and authentication issues at scale, and how to build automations and guardrails that improve the developer experience for many teams at once. It also highlights what changes when you manage many clusters in different environments and regions, and how to keep your sanity while doing it.


Marcos Prado

SREENGINEER