Kafka Stateful Stream processor with statestore: Behind the scenes

问题

I am trying to understand Stateful Stream processor.

As I understand in this type of stream-processor, it maintains some sort of state using State Store.

I came to know, one of the ways to implement State Store is using RocksDB. Assuming the following topology (and only one processor being stateful)

A->B->C ; processor B as stateful with local state store and changelog enabled. I am using low level API.

Assuming the sp listens on a single kafka topic, say topic-1 with 10 partitions.

I observed, that when the application is started (2 instances in different physical machines and num.stream.threads = 5), then for state store it creates directory structure which has something like below:

0_0 , 0_1, 0_2.... 0_9 (Each machines has five so total 10 partitions).

I was going through some online material where it said we should create a StoreBuilder and attach it topology using addStateStore() instead of creating a state store within a processor.

Like:

topology.addStateStore(storeBuilder,"processorName")

Ref also: org.apache.kafka.streams.state.Store

I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?

The second part: For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?

回答1:

I didn't understand what is the difference in attaching a storeBuilder to topology vs actually creating a statestore within processor. What is the differences between them?

In order to let Kafka Streams manage the store for you (fault-tolerance, migration), Kafka Streams needs to be aware of the store. Thus, you give Kafka Streams a StoreBuilder and Kafka Streams creates and manages the store for you.

If you just create a store inside your processor, Kafka Streams is not aware of the store and the store won't be fault-tolerant.

For statestore it creates directory like: 0_0, 0_1 etc. Who and how it gets created? Is there some sort of 1:1 mapping between the kafka topics (at which sp is listening) ande the number of directories that gets created for State Store?

Yes, there is a mapping. The store is shared base in the number of input topic partitions. You also get a "task" per partition and the task directories are name y_z with y being the sub-topology number and z being the partition number. For your simple topology you only have one sub-topology to all directories you see have the same 0_ prefix.

Hence, you logical store has 10 physical shards. This sharding allows Kafka Streams to mirgrate state when the corresponding input topic partition is assigned to a different instance. Overall, you can run up to 10 instanced and each would process one partition, and host one shard of your store.

来源：https://stackoverflow.com/questions/61622414/kafka-stateful-stream-processor-with-statestore-behind-the-scenes

标签

apache-kafka

apache-kafka-streams