Understanding Kafka poll(), flush() & commit()

问题

I’m new to Kafka and trying out few small usecase for my new application. The use case is basically, Kafka-producer —> Kafka-Consumer—> flume-Kafka source—>flume-hdfs-sink.

When Consuming(step2), below is the sequence of steps.. 1. consumer.Poll(1.0) 1.a. Produce to multiple topics (multiple flume agents are listening) 1.b. Produce. Poll() 2. Flush() every 25 msgs 3. Commit() every msgs (asynchCommit=false)

Question 1: Is this sequence of action right!?!

Question2: Will this cause any data loss as the flush is every 25 msgs and commit is for every msg?!?

Question3 :Difference between poll() for producer and poll ()consumer?

Question4 :What happens when messages are committed but not flushed!?!

I will really appreciate if someone can help me understand with offset examples between producer/consumer for poll,flush and commit.

Thanks in advance!!

回答1:

Let us first understand Kafka in short:

what is kafka producer:

t.turner@devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-producer.sh --broker-list 100.102.1.40:9092,100.102.1.41:9092 --topic company_wallet_db_v3-V3_0_0-transactions
>{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
[2019-07-21 11:53:37,907] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 7 : {company_wallet_db_v3-V3_0_0-transactions=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

You can ignore the warning. It appears as Kafka could not find the topic and auto-creates the topic.

Let us see how kafka has stored this message:

The producer creates a directory in the broker server at /kafka-logs (for apache kafka) or /kafka-cf-data (for the confluent version)

drwxr-xr-x   2 root root  4096 Jul 21 08:53 company_wallet_db_v3-V3_0_0-transactions-0

cd into this directory and then list the files. You will see the .log file that stores the actual data:

-rw-r--r--   1 root root 10485756 Jul 21 08:53 00000000000000000000.timeindex
-rw-r--r--   1 root root 10485760 Jul 21 08:53 00000000000000000000.index
-rw-r--r--   1 root root        8 Jul 21 08:53 leader-epoch-checkpoint
drwxr-xr-x   2 root root     4096 Jul 21 08:53 .
-rw-r--r--   1 root root      762 Jul 21 08:53 00000000000000000000.log

If you open the log file, you will see:

^@^@^@^@^@^@^@^@^@^@^Bî^@^@^@^@^B<96>T<88>ò^@^@^@^@^@^@^@^@^Al^S<85><98>k^@^@^Al^S<85><98>kÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^@^@^@^Aö
^@^@^@^Aè
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}^@

Let us understand how the consumer would poll and read records :

What is Kafka Poll :

Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer: The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(long).

So, poll takes a duration as input, reads the 00000000000000000000.log file for that duration, and returns them to the consumer.

When are messages removed :

Kafka takes care of the flushing of messages. There are 2 ways:

Time-based : Default is 7 days. Can be altered using log.retention.ms=1680000
Size-based : Can be set like log.retention.bytes=10487500

Now let us look at the consumer:

t.turner@devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-consumer.sh --bootstrap-server 100.102.1.40:9092 --topic company_wallet_db_v3-V3_0_0-transactions --from-beginning
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
^CProcessed a total of 1 messages

The above command instructs the consumer to read from offset = 0. Kafka assigns this console consumer a group_id and maintains the last offset that this group_id has read. So, it can push newer messages to this consumer-group

What is Kafka Commit:

Commit is a way to tell kafka the messages the consumer has successfully processed. This can be thought as updating the lookup between group-id : current_offset + 1. You can manage this using the commitAsync() or commitSync() methods of the consumer object.

Reference: https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

来源：https://stackoverflow.com/questions/57027510/understanding-kafka-poll-flush-commit

标签

apache-kafka

kafka-consumer-api

kafka-producer-api

flume