Apache Flink Streaming window WordCount

问题

I have following code to count words from socketTextStream. Both cumulate word counts and time windowed word counts are needed. The program has an issue that cumulateCounts is always the same as windowed counts. Why this issue occurs? What is the correct way to calculate cumulate counts base on windowed counts?

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final HashMap<String, Integer> cumulateCounts = new HashMap<String, Integer>();

final DataStream<Tuple2<String, Integer>> counts = env
            .socketTextStream("localhost", 9999)
            .flatMap(new Splitter())
            .window(Time.of(5, TimeUnit.SECONDS))
            .groupBy(0).sum(1)
            .flatten();

counts.print();

counts.addSink(new SinkFunction<Tuple2<String, Integer>>() {
    @Override
    public void invoke(Tuple2<String, Integer> value) throws Exception {
        String word = value.f0;
        Integer delta_count = value.f1;
        Integer count = cumulateCounts.get(word);
        if (count == null)
            count = 0;
        count = count + delta_count;
        cumulateCounts.put(word, count);
        System.out.println("(" + word + "," + count.toString() + ")");
    }
});

回答1:

You should first group-by, and apply the window on the keyed data stream (your code works on Flink 0.9.1 but the new API in Flink 0.10.0 is strict about this):

final DataStream<Tuple2<String, Integer>> counts = env
        .socketTextStream("localhost", 9999)
        .flatMap(new Splitter())
        .groupBy(0)
        .window(Time.of(5, TimeUnit.SECONDS)).sum(1)
        .flatten();

If you apply a window on a non-keyed data stream, there will be only a single threaded window operator on a single machine (ie, no parallelism) to build the window on the whole stream (in Flink 0.9.1, this global window can be split into sub-windows by groupBy() -- however, in Flink 0.10.0 this will not work any more). To counts words, you want to build a window for each distinct key value, ie, you first get a sub-stream per key value (via groupBy()) and apply a window operator on each sub stream (thus, you could have an own window operator instance for each sub-stream, allowing for parallel execution).

For a global (cumulated) count, you can simple apply a groupBy().sum() construct. First, the stream is split into sub-stream (one for each key value). Second, you compute the sum over the stream. Because the stream is not windowed, the sum in computed (cumulative) and updated for each incoming tuple (in more details, the sum has an initial result value of zero and the result is updated for each tuple as result += tuple.value). After each invocation of sum, the new current result is emitted.

In your code, you should not use your special sink function but do as follows:

counts.groupBy(0).sum(1).print();

来源：https://stackoverflow.com/questions/33446247/apache-flink-streaming-window-wordcount

标签

java

apache-flink

flink-streaming