How to manage Kafka KStream to Kstream windowed join?

前端 未结 2 1699
感动是毒
感动是毒 2020-12-11 03:53

Based on apache Kafka docs KStream-to-KStream Joins are always windowed joins, my question is how can I control the size of the window? Is it the same size for

相关标签:
2条回答
  • 2020-12-11 03:54

    In addition to what Matthias J. Sax said, there is a stream-to-stream (windowed) join example at: https://github.com/confluentinc/examples/blob/3.1.x/kafka-streams/src/test/java/io/confluent/examples/streams/StreamToStreamJoinIntegrationTest.java

    This is for Confluent 3.1.x with Apache Kafka 0.10.1, i.e. the latest versions as of January 2017. See the master branch in the repository above for code examples that use newer versions.

    Here's the key part of the code example above (again, for Kafka 0.10.1), slightly adapted to your question. Note that this example happens to demonstrate an OUTER JOIN.

    long joinWindowSizeMs = TimeUnit.MINUTES.toMillis(5);
    long windowRetentionTimeMs = TimeUnit.DAYS.toMillis(30);
    
    final Serde<String> stringSerde = Serdes.String();
    KStreamBuilder builder = new KStreamBuilder();
    KStream<String, String> alerts = builder.stream(stringSerde, stringSerde, "adImpressionsTopic");
    KStream<String, String> incidents = builder.stream(stringSerde, stringSerde, "adClicksTopic");
    
    KStream<String, String> impressionsAndClicks = alerts.outerJoin(incidents,
        (impressionValue, clickValue) -> impressionValue + "/" + clickValue,
        // KStream-KStream joins are always windowed joins, hence we must provide a join window.
        JoinWindows.of(joinWindowSizeMs).until(windowRetentionTimeMs),
        stringSerde, stringSerde, stringSerde);
    
    // Write the results to the output topic.
    impressionsAndClicks.to(stringSerde, stringSerde, "outputTopic");
    
    0 讨论(0)
  • 2020-12-11 04:11

    That is absolutely possible. When you define you Stream operator, you specify the join window size explicitly.

    KStream stream1 = ...;
    KStream stream2 = ...;
    long joinWindowSizeMs = 5L * 60L * 1000L; // 5 minutes
    long windowRetentionTimeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days
    
    stream1.leftJoin(stream2,
                     ... // add ValueJoiner
                     JoinWindows.of(joinWindowSizeMs)
    );
    
    // or if you want to use retention time
    
    stream1.leftJoin(stream2,
                     ... // add ValueJoiner
                     (JoinWindows)JoinWindows.of(joinWindowSizeMs)
                                             .until(windowRetentionTimeMs)
    );
    

    See http://docs.confluent.io/current/streams/developer-guide.html#joining-streams for more details.

    The sliding window basically defines an additional join predicate. In SQL-like syntax this would be something like:

    SELECT * FROM stream1, stream2
    WHERE
       stream1.key = stream2.key
       AND
       stream1.ts - before <= stream2.ts
       AND
       stream2.ts <= stream1.ts + after
    

    where before == after == joinWindowSizeMs in this example. before and after can also have different values if you use JoinWindows#before() and JoinWindows#after() to set those values explicitly.

    The retention time of source topics, is completely independent of the specified windowRetentionTimeMs that is applied to an changelog topic created by Kafka Streams itself. Window retention allows to join out-of-order records with each other, i.e., record that arrive late (keep in mind, that Kafka has an offset based ordering guarantee, but with regard to timestamps, record can be out-of-order).

    0 讨论(0)
提交回复
热议问题