Can I rely on a in-memory Java collection in Kafka stream for buffering events by fine tuning punctuate and commit interval?

风格不统一 提交于 2020-11-29 11:11:46

问题


A custom processor which buffers events in a simple java.util.List in process() - this buffer is not a state store.

Every 30 seconds WALL_CLOCK_TIME, punctuate() sorts this list and flushes to the sink. Assume only single partition source and sink. EOS processing guarantee is required.

I know that at any given time either process() gets executed or punctuate() gets executed.

I am concerned about this buffer not being backed by changelog topic. Ideally I believe this should have been a state store to support EOS.

But there is an argument that setting commit.interval to more than 30 seconds - i.e. say 40 seconds, will make sure that the events in the buffer would never be lost. And also since we are using WALL_CLOCK_TIME, the punctuate() will always be called every 30 seconds regardless of whether we have events are not.

Is this a valid argument? What are the cases here that will make the events in the buffer lost forever?

@Override
public void init(ProcessorContext processorContext) {
    super.init(processorContext);
    this.buffer = new ArrayList<>();
    context().schedule(Duration.ofSeconds(20L), PunctuationType.WALL_CLOCK_TIME, this::flush);
}

void flush(long timestamp){
    LOG.info("Punctuator invoked.....");
    buffer.stream().sorted(Comparator.comparing(o -> o.getId())).forEach(
            i -> context().forward(i.getId(), i)
    );
}

@Override
public void process(String key, Customer value) {
    LOG.info("Processing {}", key);
    buffer.add(value);
}

回答1:


I sort of figured few arguments against tuning commit and punctuate interval and calling this setup foolproof.

From docs, on WALL_CLOCK_TIME

This is best effort only as its granularity is limited by how long an iteration of the processing loop takes to complete

It's possible to "miss" a punctuation if: with PunctuationType#WALL_CLOCK_TIME, on GC pause, too short interval

Ideal :

punctuate : |-------20s-------|-------20s-------|------20s-------|------20s------|

c o m m it : |------------30s------------|------------30s-----------|------------30s---

Say process() took too much time (say 18 seconds) so punctuate() was not invoked for the second run at 40th second - because as doc mentioned, too short interval.

Now at 31st second, if the application crashes, even with eos enabled, events in buffer would have been committed at source. At restart, the buffer would be lost.

punctuate : |-------20s-------|------process()---------20s-------|------20s------|

c o m m it : |------------30s------------|------------30s-------------|------------30s---

Hence it is not valid argument that tuning commit and punctuate interval would curb the need for state store.



来源:https://stackoverflow.com/questions/62666790/can-i-rely-on-a-in-memory-java-collection-in-kafka-stream-for-buffering-events-b

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!