Kafka Log Compaction always shows two last records of same key

问题

Found these two questions : here and here, but I still don't quite understand. I still got (unexpected?) behaviour.

I try to log-compact kafka topic using this configuration

kafka-topics.sh --bootstrap-server localhost:9092 --create --partitions 1 --replication-factor 1 --topic test1 --config "cleanup.policy=compact" --config "delete.retention.ms=1000" --config "segment.ms=1000" --config "min.cleanable.dirty.ratio=0.01" --config "min.compaction.lag.ms=500"

Then I send these messages, each has at least 1 second interval

A: 3
A: 4
A: 5
B: 10
B: 20
B: 30
B: 40
A: 6

What I expect is after few seconds (1000 as configured?), when I run kafka-console-consumer.sh --bootstrap-server localhost:9092 --property print.key=true --topic test1 --from-beginning, I should get

A: 6
B: 40

Instead, I got :

A: 5
B: 40
A: 6

If I publish another message B:50 and runs the consumer, I got :

B: 40
A: 6
B: 50

instead of expected

A: 6
B: 50

Actually, how to configure log compaction?
From Kafka documentation : Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition
Is this means I can only use log compaction on topic with single partition?

回答1:

Basically, you provided the answer already yourself. As stated in the Kafka documentation, "log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition". So it is not guaranteed that you will always have exactly one message for one key.

If I understand the log compaction correctly, it is not meant for use cases like you came up in the very valid question. Rather, it is meant to eventually get to the stage that only one message per key is present in the topic.

Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.

A compacted topic is the right choice if you plan to keep only the latest state for each key with the goal to process as less as possible old states (what you would have with a non-compacted topic, depending on time/size-based retention). Use cases for log compaction are, as far as I have learned, rather for keeping the latest address, mobile number, value in a database etc.. Values which are not changing every moment and where you usually have many keys.

From a technical perspective I guess the following happened in your case.

When it comes to compaction, the log is viewed as split into two portions

Clean: Messages that have been compacted before. This section contains only one value for each key, which is the latest value at the time of the pervious compaction.
Dirty: Messages that were written after the last compaction.

After producing the messages B: 40 (A: 5 was already produced) the clean part of the log is empty and the dirty/active part contains A: 5 and B: 40. The message A: 6 is not yet part of the log at all. Producing the new message A: 6 will start the compaction on the dirty part (because your ratio is very low) of the log but excluding the new message itself. As mentioned, there is nothing more to clean, so the new message will just be added to the topic, and is now in the dirty part of the log. Same happens what you have observed when producing B: 50.

In addition, the compaction will never happen on your active segment. So, even though you set segment.ms to just 1000 ms it will not produce a new segment as no new data is incoming after producing A: 6 or B: 50.

To solve your issue and observe the expectations you need to produce another message C: 1 after producing A: 6 or B: 50. In that way the cleaner can compare again the clean and dirty parts of the log and will remove A: 5 or B: 40.

In the meantime, look how the segments behave in your log directory of Kafka.

From my perspective, the configurations for the log compaction is totally fine! It is just not the right use case to observe the expected behavior. But for production use case, be aware that your current configurations try to start the compaction quite frequently. This can become quite I/O intensive depending on the volume of your data. There is a reason the default ratio is set to 0.50 and the log.roll.hours is typically set to 24 hours. Also, you usually want to ensure that consumers will have the chance to read all data before it got compacted.

来源：https://stackoverflow.com/questions/61430509/kafka-log-compaction-always-shows-two-last-records-of-same-key

标签

apache-kafka