Kafka compaction for de-duplication

后端 未结 2 1488
一向
一向 2021-01-15 05:13

I\'m trying to understand how Kafka compaction works and have the following question: Does kafka guarantees uniqueness of keys for messages stored in topic with enabled comp

2条回答
  •  情话喂你
    2021-01-15 05:44

    Looking at 4 guarantees of kakfa compaction, number 4 states:

    Any consumer progressing from the start of the log will see at least the final state of all records in the order they were written. Additionally, all delete markers for deleted records will be seen, provided the consumer reaches the head of the log in a time period less than the topic's delete.retention.ms setting (the default is 24 hours). In other words: since the removal of delete markers happens concurrently with reads, it is possible for a consumer to miss delete markers if it lags by more than delete.retention.ms.

    So, you will have more than one value for the key if the head of the topic is not being retained by the delete.retention.ms policy.

    As I understand it, if you set a 24h retention policy (delete.retention.ms=86400000), you'll have a unique value for a single key, for all messages that were from 24h ago. That's your at least, but not only, as many other messages for the same key may have arrived during the last 24 hours.

    So, it is guaranteed that you'll catch at least one, but not just the last, because retention didn't act on recent messages.

    edit. As cricket's comment states, even if you set a delete retention property of 1 day, the log.roll.ms is what defines when a log segment is closed, based on message's timestamp. As this last segment is never retained for compaction, it becomes the second factor that doesn't allow you having just the last value for your known key. If your topic starts at T0, then messages after T0+log.roll.ms will be on the open log segment, thus, not compacted.

提交回复
热议问题