Kafka Connection Chaos: Surviving the Storm
Breakout Session
It is 9 AM, support team began the maintenance to renew Kafka Broker's certificates. At 9:30 AM half of the cluster has been updated correctly, but, the liveness probe metric seems unstable. We check connectivity — everything looks fine. Our monitoring stack shows it is able to consume and produce from/to all brokers. Connections are a bit higher than usual but still within limits. 9:40 AM: some teams start complaining that they can neither consume nor produce. What is happening? Suddenly, we discover the acceptor metric indicating that brokers are blocking 80% of connections. What is an acceptor, and why is it blocking our connections?
The previous paragraph describes an incident where our Kafka platform experienced a connection storm, leading to significant degradation. This event highlighted the crucial need for effective connection management and exposed our gaps in understanding Kafka’s connection handling, especially with new connections.
In this talk, we will share our journey and insights with platform teams maintaining Kafka. You’ll learn how Kafka on Linux servers manages connections and the challenges you might encounter. We will dive into the metrics and mechanisms Kafka offers to detect and protect against connection storms. And last but not least, we’ll share tips from our experience to help you avoid the mistakes we made.
Javier Hortal
adidas
Rafael García Ortega
adidas