Difference between idempotence and exactly-once in Kafka Stream

问题

I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true

idempotence: The Idempotent producer enables exactly once for a producer against a single topic. Basically each single message send has stonger guarantees and will not be duplicated in case there's an error

So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once

Why exactly-once property not available in normal Kafka Producer?

回答1:

Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.

Let me rephrase the question:

Why do we need the exactly-once delivery semantic at the consumer side while we already have guaranteed the exactly-once delivery semantic at the producer side?

Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.

This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.

Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer

Consumer part: Here is the basic workflow for the consumer:

Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.

The above steps are just a happy path. There are many issues arises in reality.

Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.

Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.

Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.

TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.

References:

Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams

回答2:

In distributed environment failure is very common scenario which can be happened any time. In Kafka environment, broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages etc. These different scenario introduced different kind of data loss and duplication.

Failure scenarios

A(Ack Failed): Producer published message successfully with retry>1 but could not received acknowledge due to failure. In that case Producer will retry same message might introduce duplicate.

B(Producer process failed in batch messages): Producer sending batch of messages it failed with few published success. In that case and once producer will restart it will again republish all message from batch which will introduce duplicate in Kafka.

C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send next message this will cause message lost.

D(Consumer failed in batch message) A consumer receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumer failed before committing to Kafka , next time Consumer will consume the same records again which reproduce duplicate on consumer side.

Exactly-Once semantics

In this case, even if a producer tries to resend a message, it leads to the message will be published and consume by consumer exactly once.

To achieve Exactly-Once semantic in Kafka , it uses below 3 property

enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )

Enable Idempotent(enable.idempotence=true)

Idempotent delivery enables producer to write message to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.

"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"

To achieve idempotence Kafka uses unique id which is called product id or PID and sequence number while producing messages. Producer keep incrementing sequence number on each message published which map with unique PID. Broker always compare current sequence number with previous one and it reject if new one is not +1 greater than previous one which avoid duplication and same time if more than greater show lost in messages

In failure scenario broker will compare sequence number with previous one and if sequence not increased +1 will reject the message.

Transaction (isolation.level)

Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be.It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.

Producer doesn't wait to write message to kafka wherease Producer uses beginTransaction, commitTransaction and abortTransaction(in case of failure) Consumer uses isolation.level either read_committed or read_uncommitted

read_committed: Consumer will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting for transactions to be committed

If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).

Exactly-Once in Producer & Consumer

In normal condition where we have seperate producer and consumer. Producer has to idempotent and same time manage transaction so consumer can use isolation.level to read only read_committed to make whole process as atomic operation. This makes guarantee that producer will always sync with source system. Even producer crash or transaction aborted , it always be consistent and publish message or batch of message as unit once.

The same consumer will either receive message or batch of message as unit once.

In Exactly-Once semantic Producer along with Consumer will appeared as atomic operation which will operate as one unit. Either publish and get consumed once at all or aborted.

Exactly Once in Kafka Stream

Kafka Stream consume messages from topic A , process and publish message to Topic B and once publish use commit(commit mostly run under cover) to flush all state store data to disk.

Exactly-once in Kafka Stream is read-process-write pattern which guarantee that these operation will be treated as atomic operation. Since Kafka Stream cater producer , consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.

Kafka Streams atomically updates consumer offsets, local state stores, state store changelog topics and production to output topics all together. If any one of these steps fail, all of the changes are rolled back.

processing.guarantee : exactly_once automatically provide below parameters you no need to set explicetly

isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5

来源：https://stackoverflow.com/questions/58894281/difference-between-idempotence-and-exactly-once-in-kafka-stream

标签

apache-kafka

apache-kafka-streams