Kafka compaction for de-duplication

后端 未结 2 1486
一向
一向 2021-01-15 05:13

I\'m trying to understand how Kafka compaction works and have the following question: Does kafka guarantees uniqueness of keys for messages stored in topic with enabled comp

相关标签:
2条回答
  • 2021-01-15 05:21

    Short answer is no.

    Kafka doesn't guarantees uniqueness for key stored with enabled topic retention.

    In Kafka you have two types of cleanup.policy:

    • delete - It means that after configured time messages won't be available. There are several properties, that can be used for that: log.retention.hours, log.retention.minutes, log.retention.ms. By default log.retention.hours is set 168. It means, that messages older than 7 days will be deleted
    • compact - For each key at least one message will be available. In some situation it can be one, but in the most cases it will be more. Compaction processed is run in background periodically. It copies log parts with removing duplicates and only leaving last value.

    If you want to read only one value for each key, you have to use KTable<K,V> abstraction from Kafka Streams.

    Related question regarding latest value for key and compaction: Kafka only subscribe to latest message?

    0 讨论(0)
  • 2021-01-15 05:44

    Looking at 4 guarantees of kakfa compaction, number 4 states:

    Any consumer progressing from the start of the log will see at least the final state of all records in the order they were written. Additionally, all delete markers for deleted records will be seen, provided the consumer reaches the head of the log in a time period less than the topic's delete.retention.ms setting (the default is 24 hours). In other words: since the removal of delete markers happens concurrently with reads, it is possible for a consumer to miss delete markers if it lags by more than delete.retention.ms.

    So, you will have more than one value for the key if the head of the topic is not being retained by the delete.retention.ms policy.

    As I understand it, if you set a 24h retention policy (delete.retention.ms=86400000), you'll have a unique value for a single key, for all messages that were from 24h ago. That's your at least, but not only, as many other messages for the same key may have arrived during the last 24 hours.

    So, it is guaranteed that you'll catch at least one, but not just the last, because retention didn't act on recent messages.

    edit. As cricket's comment states, even if you set a delete retention property of 1 day, the log.roll.ms is what defines when a log segment is closed, based on message's timestamp. As this last segment is never retained for compaction, it becomes the second factor that doesn't allow you having just the last value for your known key. If your topic starts at T0, then messages after T0+log.roll.ms will be on the open log segment, thus, not compacted.

    0 讨论(0)
提交回复
热议问题