Ensuring Client Continuity in Kafka: High Availability in Confluent Kafka

Lightning Talk

Managing large-scale Kafka clusters is both a technical challenge and an art. At Trendyol, our Data Streaming team operates Kafka as the backbone of a vast event-driven ecosystem, ensuring stability and seamless client experiences. However, we faced recurring issues during broker restarts—applications experienced connectivity errors due to misconfigured topics and improper bootstrap server configurations.

To address this, we leveraged Confluent Stretch Kafka across multiple data centers, enabling automatic leader elections without service disruptions. Additionally, we enforced topic creation and alter policies and built a custom Prometheus exporter to detect misconfigured topics in real time, allowing us to notify owners and take corrective actions proactively.

Through rigorous alerting mechanisms and enforcement via our Internal Development Platform (IDP), we have successfully eliminated disruptions during broker restarts, enabling smooth cluster upgrades and chaos testing. This session will provide practical insights into architecting resilient Kafka deployments, enforcing best practices, and ensuring high availability in a production environment handling thousands of clients.

Attendees will learn:

How multi-DC Kafka clusters ensure client continuity

The impact of misconfigured replication factors and how to prevent them

How real-time monitoring and alerts reduce operational risks

Practical strategies to enforce resilient topic configurations


Yalın Doğu Şahin

Trendyol

Mehmetcan Güleşçi

DSM GRUP DANIŞMANLIK İLETİŞİM VE SATIŞ TİCARET ANONİM ŞİRKETİ