We are running a 3 broker Kafka 0.10.0.1 cluster. We have a java app which spawns many consumer threads consuming from different topics. For every topic we have specified differ
Rebalance timeout is equal to max.poll.interval.ms
(5 minutes in your case) When rebalance starts in a group, Kafka revokes all the consumers in that group. Then waits for all alive consumers (consumers which send heartbeat) to poll() and send JoinGroupRequest.
This waiting process will end up with rebalance timeout or all the alive consumers poll() and Kafka assign partitions to these consumers.
So in your case you probably have a long running process in one of your consumers, and Kafka waits this process to complete to assign partitions.
For more information you can check these:
Consumer groups are an essential mechanism of Kafka. They allow consumers to share load and elastically scale by dynamically assigning the partitions of topics to consumers. In our current model of consumer groups, whenever a rebalance happens every consumer from that group experiences downtime - their poll() calls block until every other consumer in the group calls poll(). That is due to the fact that every consumer needs to call JoinGroup in a rebalance scenario in order to confirm it is still in the group.
Today, if the client has configured
max.poll.interval.ms
to a large value, the group coordinator broker will take in an unlimited number of join group requests and the rebalance could therefore continue for an unbounded amount of time. (https://cwiki.apache.org/confluence/display/KAFKA/KIP-389%3A+Introduce+a+configurable+consumer+group+size+limit)
-
Since we give the client as much as max.poll.interval.ms to handle a batch of records, this is also the maximum time before a consumer can be expected to rejoin the group in the worst case. We therefore propose to set the rebalance timeout in the Java client to the same value configured with max.poll.interval.ms. When a rebalance begins, the background thread will continue sending heartbeats. The consumer will not rejoin the group until processing completes and the user calls poll(). From the coordinator's perspective, the consumer will not be removed from the group until either 1) their session timeout expires without receiving a heartbeat, or 2) the rebalance timeout expires.
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread)