Migrating a Large-Scale Kafka Streams Platform to the KIP-1071 Rebalance Protocol

Breakout Session

For half a dozen years, we’ve been running Kafka Streams applications at bakdata, constantly balancing throughput, stability, and operational overhead. A recurring pain has been frequent or slow rebalances that negatively impact the latency and throughput of Streams apps. While rebalancing is necessary to optimally distribute tasks - such as when members join or leave a group - frequent rebalances can be disruptive. Incremental cooperative rebalancing (Kafka 2.4+) reduces partition movement and even keeps processing running during rebalancing (Kafka 2.5) but does not fully alleviate the pain.With growing experience, we optimized configs of our Streams apps for specific app characteristics to avoid frequent and slow rebalances that stall processing. Such characteristics include high-latency record processing, application statefulness, or dynamic horizontal scaling.The new Streams rebalance protocol introduced in KIP-1071 makes assignment of Streams tasks a first-class citizen in the Kafka protocol. It promises to make rebalances less disruptive to the processing.As part of our migration efforts to the new protocol, we adjusted our established Streams configs where necessary. With our insights, we hope to help others migrate to the new Streams rebalance protocol once it becomes generally available in an upcoming Kafka release.For the classic rebalance protocol, we used to tune configs on a per-app basis to prevent undesirable rebalances. Depending on the Streams app’s characteristics, we typically made some of the following adjustments:

Increase group.initial.rebalance.delay.ms on the broker to give members more time to join the consumer group during “scale-from-0” of a Streams app.

Increase max.poll.interval.ms and decrease max.poll.records and/or max.partition.fetch.bytes to account for high-latency processing of individual records.

Set group.instance.id - not just for stateful apps - to mitigate the impact of member restarts.

Increase transaction.timeout.ms in exactly-once scenarios.

In our session, we aim to answer what the new Streams rebalance protocol means for a mature Kafka Streams platform.

Using the new protocol, what Streams app configs need tweaking? Do some configs’ defaults suffice now where we previously needed to tailor configs to an app’s characteristics?

What changes for the observability of Streams apps?

How does rebalancing behavior differ in terms of frequency and rebalance duration?

What impact does this have on latency and throughput of real-world Streams apps?

Join us to ready yourself for the new Streams rebalance protocol!


Jakob Edding

bakdata