Understanding Kafka poll(), flush() & commit()

我与影子孤独终老i 提交于 2021-02-08 03:28:25

问题


I’m new to Kafka and trying out few small usecase for my new application. The use case is basically, Kafka-producer —> Kafka-Consumer—> flume-Kafka source—>flume-hdfs-sink.

When Consuming(step2), below is the sequence of steps.. 1. consumer.Poll(1.0) 1.a. Produce to multiple topics (multiple flume agents are listening) 1.b. Produce. Poll() 2. Flush() every 25 msgs 3. Commit() every msgs (asynchCommit=false)

Question 1: Is this sequence of action right!?!

Question2: Will this cause any data loss as the flush is every 25 msgs and commit is for every msg?!?

Question3 :Difference between poll() for producer and poll ()consumer?

Question4 :What happens when messages are committed but not flushed!?!

I will really appreciate if someone can help me understand with offset examples between producer/consumer for poll,flush and commit.

Thanks in advance!!


回答1:


Let us first understand Kafka in short:

what is kafka producer:

t.turner@devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-producer.sh --broker-list 100.102.1.40:9092,100.102.1.41:9092 --topic company_wallet_db_v3-V3_0_0-transactions
>{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
[2019-07-21 11:53:37,907] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 7 : {company_wallet_db_v3-V3_0_0-transactions=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

You can ignore the warning. It appears as Kafka could not find the topic and auto-creates the topic.

Let us see how kafka has stored this message:

The producer creates a directory in the broker server at /kafka-logs (for apache kafka) or /kafka-cf-data (for the confluent version)

drwxr-xr-x   2 root root  4096 Jul 21 08:53 company_wallet_db_v3-V3_0_0-transactions-0

cd into this directory and then list the files. You will see the .log file that stores the actual data:

-rw-r--r--   1 root root 10485756 Jul 21 08:53 00000000000000000000.timeindex
-rw-r--r--   1 root root 10485760 Jul 21 08:53 00000000000000000000.index
-rw-r--r--   1 root root        8 Jul 21 08:53 leader-epoch-checkpoint
drwxr-xr-x   2 root root     4096 Jul 21 08:53 .
-rw-r--r--   1 root root      762 Jul 21 08:53 00000000000000000000.log

If you open the log file, you will see:

^@^@^@^@^@^@^@^@^@^@^Bî^@^@^@^@^B<96>T<88>ò^@^@^@^@^@^@^@^@^Al^S<85><98>k^@^@^Al^S<85><98>kÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^@^@^@^Aö
^@^@^@^Aè
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}^@

Let us understand how the consumer would poll and read records :

What is Kafka Poll :

Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer: The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(long).

So, poll takes a duration as input, reads the 00000000000000000000.log file for that duration, and returns them to the consumer.

When are messages removed :

Kafka takes care of the flushing of messages. There are 2 ways:

  1. Time-based : Default is 7 days. Can be altered using log.retention.ms=1680000
  2. Size-based : Can be set like log.retention.bytes=10487500

Now let us look at the consumer:

t.turner@devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-consumer.sh --bootstrap-server 100.102.1.40:9092 --topic company_wallet_db_v3-V3_0_0-transactions --from-beginning
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
^CProcessed a total of 1 messages

The above command instructs the consumer to read from offset = 0. Kafka assigns this console consumer a group_id and maintains the last offset that this group_id has read. So, it can push newer messages to this consumer-group

What is Kafka Commit:

Commit is a way to tell kafka the messages the consumer has successfully processed. This can be thought as updating the lookup between group-id : current_offset + 1. You can manage this using the commitAsync() or commitSync() methods of the consumer object.

Reference: https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html



来源:https://stackoverflow.com/questions/57027510/understanding-kafka-poll-flush-commit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!