I have a topic with the following description:
Topic:test-topic PartitionCount:1 ReplicationFactor:1 Configs:min.cleanable.dirty.ratio=0.01,min.compac
As Natalia said, we can log.segment.bytes
config determines the log segment file roll by size.
If you don't want to change the log.segment.bytes
(or) you don't want to wait till the log segment reaches its full size, you can use log.segment.ms
config to roll the log segment file by time and trigger compaction.
"Kafka should be deploying log compaction under these conditions and pruning all but the newest message with a given key."
This seems to be one of the major misconceptions when it comes to Kafka's log compaction. In the Kafka documentation on Log Compaction it is noted that
"Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition."
There is no guarantee that there are no duplicate keys in a compacted topic. As mentioned in the other answers it is crucial to have a closer look into the physical storage of your data into the logs/segments. In the LogCleaner class you can find a good explanation for the difference between clean and dirty segments:
A message with key K and offset O is obsolete if there exists a message with key K and offset O' such that O < O'. Each log can be thought of being split into two sections of segments: a "clean" section which has previously been cleaned followed by a "dirty" section that has not yet been cleaned. The dirty section is further divided into the "cleanable" section followed by an "uncleanable" section. The uncleanable section is excluded from cleaning. The active log segment is always uncleanable. If there is a compaction lag time set, segments whose largest message timestamp is within the compaction lag time of the cleaning operation are also uncleanable."
The guarantees that log compaction provides is given in the Kafka documentation together with the most important configurations to control the cleaner:
1) To activate compaction cleanup policy cleanup.policy=compact should be placed
2) The consumer sees all tombstones as long as the consumer reaches head of a log in a period less than the topic config delete.retention.ms (the default is 24 hours).
3) The number of these threads are configurable through log.cleaner.threads config
4) The cleaner thread then chooses the log with the highest dirty ratio. dirty ratio = the number of bytes in the head / total number of bytes in the log(tail + head)
5) Topic config min.compaction.lag.ms gets used to guarantee a minimum period that must pass before a message can be compacted.
6) To set delay to start compacting records after they are written use topic config log.cleaner.min.compaction.lag.ms. Records won’t get compacted until after this period. The setting gives consumers time to get every record.
To have compaction running you need to have at least 2 segment files (one finished and one running).
according to your configuration
log.segment.bytes=1073741824
log.segment.bytes=536870912
(please check why you have two identical properties).
You need to have one file 512Mb full so kafka can run compaction on it. Please check that you have at least 2 segment files for topic-partition you want to be compacted