Kafka consumer, very long rebalances

前端 未结 4 2032
北恋
北恋 2021-02-20 11:55

We are running a 3 broker Kafka 0.10.0.1 cluster. We have a java app which spawns many consumer threads consuming from different topics. For every topic we have specified differ

相关标签:
4条回答
  • 2021-02-20 12:11

    Your consumer configuration seems reasonable. I would advise trying three things:

    • Try to spawn a single consumer thread, and assign it only one of the topics you're trying to consume from. That single thread should get all partitions for that topic assigned, and it should immediately start receiving all the data. You can try to print out the partition and message offset, as well as content, to validate that it's getting all the data.
    • Once you validate that's working, spawn a single consumer thread, and assign it all the topics you're trying to consume from. Do the same validation printing out the messages.
    • Finally, if that is working fine, start adding consumer threads one by one, and see if you start getting pauses when consuming.

    That should allow you to pinpoint where the problem is. If you're able to consume everything with a single thread, but not with multiple threads, then your threading mechanism/pooling might have issues.

    0 讨论(0)
  • 2021-02-20 12:20

    Check the __consumer_offsets partitions size on disk. We faced similar issue that was due to compaction errors. This leads to very long rebalances. See https://issues.apache.org/jira/browse/KAFKA-5413 for more details (solved since kafka 0.10.2.2 / 0.11) Another option is that that your broker configuration is incorrect, and compaction is turned off, and log.cleaner.enable if false. __consumer_offsets is a compacted topic, so if log.cleaner is disabled, it will not be compacted and lead to the same symptom.

    0 讨论(0)
  • 2021-02-20 12:21

    Rebalance timeout is equal to max.poll.interval.ms (5 minutes in your case) When rebalance starts in a group, Kafka revokes all the consumers in that group. Then waits for all alive consumers (consumers which send heartbeat) to poll() and send JoinGroupRequest.

    This waiting process will end up with rebalance timeout or all the alive consumers poll() and Kafka assign partitions to these consumers.

    So in your case you probably have a long running process in one of your consumers, and Kafka waits this process to complete to assign partitions.

    For more information you can check these:

    Consumer groups are an essential mechanism of Kafka. They allow consumers to share load and elastically scale by dynamically assigning the partitions of topics to consumers. In our current model of consumer groups, whenever a rebalance happens every consumer from that group experiences downtime - their poll() calls block until every other consumer in the group calls poll(). That is due to the fact that every consumer needs to call JoinGroup in a rebalance scenario in order to confirm it is still in the group.

    Today, if the client has configured max.poll.interval.ms to a large value, the group coordinator broker will take in an unlimited number of join group requests and the rebalance could therefore continue for an unbounded amount of time. (https://cwiki.apache.org/confluence/display/KAFKA/KIP-389%3A+Introduce+a+configurable+consumer+group+size+limit)

    -

    Since we give the client as much as max.poll.interval.ms to handle a batch of records, this is also the maximum time before a consumer can be expected to rejoin the group in the worst case. We therefore propose to set the rebalance timeout in the Java client to the same value configured with max.poll.interval.ms. When a rebalance begins, the background thread will continue sending heartbeats. The consumer will not rejoin the group until processing completes and the user calls poll(). From the coordinator's perspective, the consumer will not be removed from the group until either 1) their session timeout expires without receiving a heartbeat, or 2) the rebalance timeout expires.

    (https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread)

    0 讨论(0)
  • 2021-02-20 12:26

    I suspect your cluster version is at least 0.10.1.0 as I see max.poll.interval.ms in your consumer configuration which was introduced in this version.

    Kafka 0.10.1.0 integrates KIP-62 which introduces a rebalance timeout set to max.poll.interval.ms and its default value is 5 minutes.

    I guess if you don't want to wait timeout expiration during a rebalance, your consumers need to cleanly leave consumer group by calling close() method.

    0 讨论(0)
提交回复
热议问题