What are internal topics used in Kafka?

问题

We are using kafka stream api for aggregation in which we are also using group by. We are also using state store where it saves the input topics data.

What i notice is

Kafka internally creates 3 kinds of topic

Changelog-<storeid>-<partition>
Repartition-<storeid>-<partition>
<topicname>-<partition>

What I am not able to understand is

Why it creates changelog topic when I have all the data in <topic>-<partition>
Does repartition topic contains data after grouping.
and I see that the size of Changelog and topicname-parition are approx same.

What is different in the data so that it has to save a different file for that.

回答1:

There are several types of internal Kafka topics:

__consumer_offsets is used to store offset commits per topic/partition.
__transaction_state is used to keep state for Kafka producers and consumers using transactional semantics.
_schemas is used by Schema Registry to store all the schemas, metadata and compatibility configuration.
The following three topics are examples of internal topics used by Kafka Streams. The first two are regular join information, the third one is actually a RocksDB persistent StateStore:
- {consumer-group}--KSTREAM-JOINOTHER-0000000005-store-changelog
- {consumer-group}--KSTREAM-JOINTHIS-0000000004-store-changelog
- {consumer-group}--incompleteMessageStore-changelog

Some more information here:

What is the use of __consumer_offsets and _schema topics in Kafka?

回答2:

'Changelog' and 'repartition' internal Kafka topics are specific to Kafka Streams.

From Kafka Wiki,

Kafka Streams allows for stateful stream processing, i.e. operators that have an internal state. This internal state is managed in so-called state stores. A state store can be ephemeral (lost on failure) or fault-tolerant (restored after the failure). The default implementation used by Kafka Streams DSL is a fault-tolerant state store using 1. an internally created and compacted changelog topic (for fault-tolerance) and 2. one (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.

Changelog topics are created when there are join/aggregation operations on the stream. Actually the result of aggregation call creates a state store and for fault-tolerance the state store is backed up by a Kafka Changelog topic.

The aggregation results are stored into this internal topic. State will be recovered from changelog topic when applications is restarted and application-id wasn't changed.

Re-partition topics are created when there are key modifying operations on the stream. For example, groupByKey() operation creates repartition topic. Check JIRA page to know more about auto creation of re-parition topic.

These two internal topics enables Kafka streams to have fault-tolerant stateful stream processing capabilities.

Does repartition topic contains data after grouping? - Yes

The size of Changelog and topicname-parition are approx same - Possibly, the result of all aggregation operations are stored in this topic.

For more details, please check Kafka Wiki page.

来源：https://stackoverflow.com/questions/56080896/what-are-internal-topics-used-in-kafka

标签

apache-kafka

apache-kafka-streams