Kafka Streaming tasks and management of Internal state stores

时间秒杀一切 提交于 2020-12-15 06:07:22

问题


Lets say we have launched 2 Streaming-Tasks at 2 different machines (instances) with following properties :-

public final static String applicationID = "StreamsPOC";
public final static String bootstrapServers = "10.21.22.56:9093";    
public final static String topicname = "TestTransaction";
public final static String shipmentTopicName = "TestShipment";
public final static String RECORD_COUNT_STORE_NAME = "ProcessorONEStore";

and using these aforesaid properties, here is how stream-task's definition looks like :-

        Map<String, String> changelogConfig = new HashMap();
        changelogConfig.put("min.insyc.replicas", "1");
        // Below line not working.
        changelogConfig.put("topic", "myChangedTopicLog");
       
        StoreBuilder kvStoreBuilder = Stores.keyValueStoreBuilder(
                Stores.persistentKeyValueStore(AppConfigs.RECORD_COUNT_STORE_NAME),
                AppSerdes.String(), AppSerdes.Integer()
        ).withLoggingEnabled(changelogConfig);

        kStreamBuilder.addStateStore(kvStoreBuilder);


        KStream<String, String> sourceKafkaStream = kStreamBuilder.stream
                (AppConfigs.topicname, Consumed.with(AppSerdes.String(), AppSerdes.String()));

Now, as I observed, under the hoods, kafka created the topic under the hoods(for the purpose of backing up the Internal state store) with following name:- StreamsPOC-ProcessorONEStore-changelog

First Question is :- Whether both of the different streaming tasks maintains and backs-up the Internal-State-store to the same topic ?

Second question is ;- Say Task-1 picks-up on the partition-1 and it writes say <K1, V1> to its local internal-state-store and Task-2 starts working on Partition-2 and say it also writes <K1,V1> to its local respective state-store, then does it not throws the risk of data being overridden, as both of the tasks are backing up the data to same changelog topic ?

Third Question is :- How can I specify the custom-name to Change-log-topic ?

Responses shall be highly appreciated !!


回答1:


First, some thought on terminology: the term "task" has a well-define meaning in Kafka Stream and as a user you don't create tasks by yourself. When your program is executed, Kafka Streams creates tasks that are "independents units of computation" and executes those tasks for you. -- I guess, what you mean by "task" is actually a KafkaStreams client (that is called an instance).

If you start multiple instances with the same application.id they will form a consumer group and they will share the load in a data-parallel manner. For state stores, each instance will host a shard (sometimes also called partition) of the store. All instances use the same topic and the topic has a partition for each store shard. There is a 1:1 mapping from store shard to changelog partition. Furthermore, there is a 1:1 mapping from input topic partitions to tasks and a 1:1 mapping between tasks and store shards. Thus, overall it's a 1:1:1:1 mapping: For each input topic partition one task is created and each task holds one shard of the state store and each store shard is backed by one partition of the changelog topic. (Ie, bottom line is, that the number of input topic partitions determines how many parallel task and store shards you get, and the changelog topic is created the the same number of partitions as your input topic.)

  1. So yes, all instances use the same changelog topic.
  2. As tasks are isolated via shards and changelog topic partitions, they won't overwrite each other. However, the idea of tasks is that each task processes a different (non-overlapping) key-space, and thus all records with the same <k1,...> should be processed by the same tasks. Of course, there might be exceptions from this rule and if your application does not use non-overlapping key-spaces the program will just be executed (of course, depending on your business logic requirement, this might be correct or incorrect).
  3. It seems you did already: note, that you can only customize parts of the changelog topic name: <application.id>-<storeName>-changelog -- ie, you can pick the application.id and storeName. The overall naming pattern is hard-coded though.


来源:https://stackoverflow.com/questions/64285276/kafka-streaming-tasks-and-management-of-internal-state-stores

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!