问题
using KSQL, and performing left outer join, i can see the result of my join sometime emitted more than once.
In other words, the same join result is emitted more than once. I am not talking about, a version of the join with the null value on the right side and a version without the null value. Literally the same record that result from a join is emitted more than once.
I wonder if that is an expected behaviour.
回答1:
the general answer is yes. kafka is an at-least-once system. more specifically, a few scenarios can result in duplication:
- consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
- producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
- if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.
are you seeing any such crashes/timeouts in your logs?
there are a few kafka features you could try using to reduce the likelihood of this happening to you:
- set
enable.idempotence
to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead - use transactions when producing - incurs overhead and adds latency
- set
transactional.id
on the producer in case your fail over across machines - gets complicated to manage at scale - set
isolation.level
toread_committed
on the consumer - adds latency (needs to be done in combination with 2 above) - shorten
auto.commit.interval.ms
on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
来源:https://stackoverflow.com/questions/57895856/ksql-table-table-left-outer-join-emit-same-join-result-more-than-once