apache-kafka-streams | 易学教程

How to discover and filter out duplicate records in Kafka Streams

阅读更多关于 How to discover and filter out duplicate records in Kafka Streams

问题 Say you have a topic with a null key and the value is {id:1, name:Chris, age:99} Lets say you want to count up the number of people by name. You would do something like below: nameStream.groupBy((key,value) -> value.getName()) .count(); Now lets says it is valid you can get duplicate records and you can tell it is a duplicate based on the id. For example: {id:1, name:Chris, age:99} {id:1, name:Chris, age:xx} Should result in a count of one and {id:1, name:Chris, age:99} {id:2, name:Chris, age

Kafka Streams - How to scale Kafka store generated changelog topics

阅读更多关于 Kafka Streams - How to scale Kafka store generated changelog topics

问题 I am having multiple redundant app instances that want to consume all the events of a topic and store them independently for disk lookup (via a rocksdb). For the sake of the argument, let's assume these redundant consumers are serving stateless http request; so the load is not shared using kafka, but kafka is rather used to replicate data from a producer into each of the instance localstore. When looking at the topics generated, each consuming apps created 3 extra topics : {topicname}STATE

Kafka Streams - How to scale Kafka store generated changelog topics

阅读更多关于 Kafka Streams - How to scale Kafka store generated changelog topics

Why don't I see any output from the Kafka Streams reduce method?

阅读更多关于 Why don't I see any output from the Kafka Streams reduce method?

问题 Given the following code: KStream<String, Custom> stream = builder.stream(Serdes.String(), customSerde, "test_in"); stream .groupByKey(Serdes.String(), customSerde) .reduce(new CustomReducer(), "reduction_state") .print(Serdes.String(), customSerde); I have a println statement inside the apply method of the Reducer, which successfully prints out when I expect the reduction to take place. However, the final print statement shown above displays nothing. likewise if I use a to method rather than

Time semantics between KStream and KTable

阅读更多关于 Time semantics between KStream and KTable

问题 I am trying to build the following topology: Using Debezium Connectors, I am pulling 2 tables (let's called them tables A, and DA). As per DBZ, the topics where the table rows are stored have the structure { before: "...", after: "..." }. First steps in my topology are to create "clean" KStreams off these two "table" topics. The sub-topology there looks roughly like this: private static KStream<String, TABLE_A.Value> getTableARowByIdStream( StreamsBuilder builder, Properties streamsConfig) {

How to register a stateless processor (that seems to require a StateStore as well)?

阅读更多关于 How to register a stateless processor (that seems to require a StateStore as well)?

问题 I'm building a topology and want to use KStream.process() to write some intermediate values to a database. This step doesn't change the nature of the data and is completely stateless. Adding a Processor requires to create a ProcessorSupplier and pass this instance to the KStream.process() function along with the name of a state store. This is what I don't understand. How to add a StateStore object to a topology since it requires a StateStoreSupplier? Failing to add a said StateStore gives

Kafka Streams - Send on different topics depending on Streams Data

阅读更多关于 Kafka Streams - Send on different topics depending on Streams Data

问题 I have a kafka streams application waiting for records to be published on topic user_activity . It will receive json data and depending on the value of against a key I want to push that stream into different topics. This is my streams App code: KStream<String, String> source_user_activity = builder.stream("user_activity"); source_user_activity.flatMapValues(new ValueMapper<String, Iterable<String>>() { @Override public Iterable<String> apply(String value) { System.out.println("value: " +

Kafka Streams - How to better control partitioning of internally created state store topic?

阅读更多关于 Kafka Streams - How to better control partitioning of internally created state store topic?

问题 State stores in Kafka Streams are created internally. State stores are partitioned by key, but do not allow to provide partitioning other than by key (to my knowledge). QUESTIONS How to control the number of partitions of a state-store internally created topic ? How does the state store topic infer the number of partitions and the partitioning to use by default, and how to override? How to work it around if you want to partition your state-store by something other than the key of your

Streaming messages from one Kafka Cluster to another

阅读更多关于 Streaming messages from one Kafka Cluster to another

问题 I'm currently trying to, easily, stream messages from a Topic on one Kafka cluster to another one (Remote -> Local Cluster). The idea is to use Kafka-Streams right away so that we don't need to replicate the actual messages on the local cluster but only get the "results" of the Kafka-Streams processing into our Kafka-Topics. So let's say the WordCount demo is on one Kafka-Instance on another PC than my own. I also have a Kafka-Instance running on my local machine. Now I want to let the

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

阅读更多关于 Is it a good practice to do sync database query or restful call in Kafka streams jobs?

问题 I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? 回答1: A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app