问题
The datastream is partitioned and distributed to each slot for processing. Now I can get the result of each partitioned task. What is the best approach to apply some function to those result of different partitions and get a global summary result?
Updated: I want to implement some data summary algorithm such as Misra-Gries in Flink. It will maintain k counters and update with data arriving. Because data may be large scalable, It's better that each partition has its own k counters and process parallel. Finally merge those counters to final k counters to present the result. What is the best way to do combination?
回答1:
Flink's built-in aggregation functions, like reduce
, sum
, and max
are built on top of Flink's managed keyed state mechanism, and can only be applied to a KeyedStream
. What you can do, however, is use either WindowAll or ProcessFunction. Here is an example:
parallelStream
.process(new MyProcessFunction())
.setParallelism(1)
.print()
.setParallelism(1);
Note that all of the preliminary processing is being done at the default parallelism, and then the process function and print are being applied serially.
The ProcessFunction
should keep its state in managed operator (non-keyed) state in order to be fault tolerant.
This will produce a continuously updated stream of summaries over the entire input. Use something like countWindowAll
or timeWindowAll
if you prefer to produce summaries over windows.
来源:https://stackoverflow.com/questions/47842618/flink-what-is-the-best-way-to-summarize-the-result-from-all-partitions