Frequent “offset out of range” messages, partitions deserted by consumer

问题

We are running 3 node Kafka 0.10.0.1 cluster. We have a consumer application which has a single consumer group connecting to multiple topics. We are seeing strange behaviour in consumer logs. With these lines

 Fetch offset 1109143 is out of range for partition email-4, resetting offset
 Fetch offset 952168 is out of range for partition email-7, resetting offset
 Fetch offset 945796 is out of range for partition email-5, resetting offset
 Fetch offset 950900 is out of range for partition email-0, resetting offset
 Fetch offset 953163 is out of range for partition email-3, resetting offset
 Fetch offset 1118389 is out of range for partition email-6, resetting offset
 Fetch offset 1112177 is out of range for partition email-2, resetting offset
 Fetch offset 1109539 is out of range for partition email-1, resetting offset

Some time later we saw these logs

[2018-06-08 19:45:28] :: INFO  :: ConsumerCoordinator:333 - Revoking previously assigned partitions [sms-4, sms-3, sms-0, sms-2, sms-1] for group notifications-consumer
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator:381 - (Re-)joining group notifications-consumer
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO  :: ConsumerCoordinator:225 - Setting newly assigned partitions [sms-8, sms-7, sms-9, sms-6, sms-5] for group notifications-consumer

I noticed that one of our topics was not seen in the list of Setting newly assigned partitions. Then that topic had no consumers attached to it for 8 hours at least. It's only when someone restarted application it started consuming from that topic. What can be going wrong here?

Here is consumer config

auto.commit.interval.ms = 3000
auto.offset.reset = latest
bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, x.x.x.x:9092]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = otp-notifications-consumer
heartbeat.interval.ms = 3000
interceptor.classes = null
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 50
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.ms = 50
request.timeout.ms = 305000
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
session.timeout.ms = 300000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = /x/x/client.truststore.jks
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

The topic which went orphan has 10 partitions, retention.ms=1800000, segment.ms=1800000. Please help.

回答1:

The offset out of range message you are seeing usually indicates the offset the consumer is at has been deleted on the broker. Upon hitting that the consumer will use auto.offset.reset to restart consuming.

With retention.ms=1800000 (30mins), you are only keeping data for a very short amount of time so it's expected that if you restart the consumer after several hours, the data is gone.

来源：https://stackoverflow.com/questions/50894710/frequent-offset-out-of-range-messages-partitions-deserted-by-consumer

标签

java

apache-kafka