Flink: What is the best way to summarize the result from all partitions

青春壹個敷衍的年華 提交于 2019-12-11 13:16:45

问题


The datastream is partitioned and distributed to each slot for processing. Now I can get the result of each partitioned task. What is the best approach to apply some function to those result of different partitions and get a global summary result?

Updated: I want to implement some data summary algorithm such as Misra-Gries in Flink. It will maintain k counters and update with data arriving. Because data may be large scalable, It's better that each partition has its own k counters and process parallel. Finally merge those counters to final k counters to present the result. What is the best way to do combination?


回答1:


Flink's built-in aggregation functions, like reduce, sum, and max are built on top of Flink's managed keyed state mechanism, and can only be applied to a KeyedStream. What you can do, however, is use either WindowAll or ProcessFunction. Here is an example:

parallelStream
  .process(new MyProcessFunction())
  .setParallelism(1)
  .print()
  .setParallelism(1);

Note that all of the preliminary processing is being done at the default parallelism, and then the process function and print are being applied serially.

The ProcessFunction should keep its state in managed operator (non-keyed) state in order to be fault tolerant.

This will produce a continuously updated stream of summaries over the entire input. Use something like countWindowAll or timeWindowAll if you prefer to produce summaries over windows.



来源:https://stackoverflow.com/questions/47842618/flink-what-is-the-best-way-to-summarize-the-result-from-all-partitions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!