问题
A custom processor which buffers events in a simple java.util.List
in process()
- this buffer is not a state store.
Every 30 seconds WALL_CLOCK_TIME, punctuate()
sorts this list and flushes to the sink. Assume only single partition source and sink. EOS processing guarantee is required.
I know that at any given time either process()
gets executed or punctuate()
gets executed.
I am concerned about this buffer not being backed by changelog topic. Ideally I believe this should have been a state store to support EOS.
But there is an argument that setting commit.interval
to more than 30 seconds - i.e. say 40 seconds, will make sure that the events in the buffer would never be lost. And also since we are using WALL_CLOCK_TIME
, the punctuate()
will always be called every 30 seconds regardless of whether we have events are not.
Is this a valid argument? What are the cases here that will make the events in the buffer lost forever?
@Override
public void init(ProcessorContext processorContext) {
super.init(processorContext);
this.buffer = new ArrayList<>();
context().schedule(Duration.ofSeconds(20L), PunctuationType.WALL_CLOCK_TIME, this::flush);
}
void flush(long timestamp){
LOG.info("Punctuator invoked.....");
buffer.stream().sorted(Comparator.comparing(o -> o.getId())).forEach(
i -> context().forward(i.getId(), i)
);
}
@Override
public void process(String key, Customer value) {
LOG.info("Processing {}", key);
buffer.add(value);
}
回答1:
I sort of figured few arguments against tuning commit and punctuate interval and calling this setup foolproof.
From docs, on WALL_CLOCK_TIME
This is best effort only as its granularity is limited by how long an iteration of the processing loop takes to complete
It's possible to "miss" a punctuation if: with PunctuationType#WALL_CLOCK_TIME, on GC pause, too short interval
Ideal :
punctuate : |-------20s-------|-------20s-------|------20s-------|------20s------|
c o m m it : |------------30s------------|------------30s-----------|------------30s---
Say process()
took too much time (say 18 seconds) so punctuate()
was not invoked for the second run at 40th second - because as doc mentioned, too short interval.
Now at 31st second, if the application crashes, even with eos enabled, events in buffer would have been committed at source. At restart, the buffer would be lost.
punctuate : |-------20s-------|------process()---------20s-------|------20s------|
c o m m it : |------------30s------------|------------30s-------------|------------30s---
Hence it is not valid argument that tuning commit and punctuate interval would curb the need for state store.
来源:https://stackoverflow.com/questions/62666790/can-i-rely-on-a-in-memory-java-collection-in-kafka-stream-for-buffering-events-b