Kafka Streams - Processor context commit

问题

should we ever invoke processorContext.commit() in Processor implementation by ourselves? I mean invoking commit method inside scheduled Punctuator implementation or inside process method.

in which use cases should we do that, and do we need that at all? the question relates to both Kafka DSL with transform() and Processor API.

seems Kafka Streams handles it by itself, also invoking processorContext.commit() does not guarantee that it will be done immediately.

回答1:

It is ok to call commit() -- either from the Processor or from a Punctuation -- that's why this API is offered.

While Kafka Streams commits on a regular (configurable) interval, you can request intermediate commits when you use it. One example use case would be, that you usually do cheap computation, but sometimes you do something expensive and want to commit asap after this operation instead of waiting for the next commit interval (to reduce the likelihood of a failure after the expensive operation and the next commit interval). Another use case would be, if you set the commit interval to MAX_VALUE what effectively "disables" regular commits and to decide when to commit base on your business logic.

I guess, that calling commit() is not necessary for most use cases thought.

回答2:

For the use case I am batching certain number of record in processor process method and writing the batched records to File from process function if the batch size reaches like certain number(lets say 10).

Lets say we write one batch of records to file and system crashes at the point before commit happens (Since we cann't call explicit commits). Next time the stream starts and processor processes the records from the last committed offset. This means we could be writing some duplicate data to files. Is there anyway to avoid writing duplicate data??

来源：https://stackoverflow.com/questions/54075610/kafka-streams-processor-context-commit

标签

apache-kafka

apache-kafka-streams