Reuse of a Stream is a copy of stream or not

江枫思渺然 提交于 2019-12-24 02:52:42

问题


For example, there is a keyed stream:

val keyedStream: KeyedStream[event, Key] = env
    .addSource(...)
    .keyBy(...)

// several transformations on the same stream
keyedStream.map(....)
keyedStream.window(....)
keyedStream.split(....)
keyedStream...(....)

I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream.

  • But I don't know if it is right or not.

  • If yes, this will use a lot of resources(which resources?) to keep the copies ?


回答1:


A DataStream (or KeyedStream) on which multiple operators are applied replicates all outgoing messages. For instance, if you have a program such as:

val keyedStream: KeyedStream[event, Key] = env
  .addSource(...)
  .keyBy(...)

val stream1: DataStream = keyedStream.map(new MapFunc1)
val stream2: DataStream = keyedStream.map(new MapFunc2)

The program is executed as

           /-hash-> Map(MapFunc1) -> ...
 Source >-<
           \-hash-> Map(MapFunc2) -> ...

The source replicates each record and sends it to both downstream operators (MapFunc1 and MapFunc2). The type of the operators (in our example Map) does not matter.

The cost of this is sending each record twice over the network. If all receiving operators have the same parallelism it could be optimized by sending each record once and duplicating it at the receiving task manager, but this is currently not done.

You manually optimize the program, by adding a single receiving operator (e.g., an identity Map operator) and another keyBy from which you fork to the multiple receivers. This will not result in a network shuffle, because all records are already local. All operator must have the same parallelism though.



来源:https://stackoverflow.com/questions/47750597/reuse-of-a-stream-is-a-copy-of-stream-or-not

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!