Enriching DataStream using static DataSet in Flink streaming

ⅰ亾dé卋堺 提交于 2019-12-12 19:06:08

问题


I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB).

For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not.

An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by user id and then do a look up for each event in the stream into this DataSet.

Since Flink does not allow using DataSets in a streaming program, how can I achieve the above ?

Another option could be to use Managed Operator State to store buyers set, but how can I keep this state distributed by user id so as to avoid network i/o in individual event look ups ? In case of memory state backend, does state remain distributed by some key, or is it replicated across all operator subtasks ?

What is the right design pattern to achieve the above enriching requirement in a Flink streaming program ?


回答1:


I would key the stream by user_id, and use a RichFlatMap to do the enrichment. In the open() method of the RichFlatMap you can load the static buyer flag for that user and keep it cached in a boolean field.



来源:https://stackoverflow.com/questions/49645285/enriching-datastream-using-static-dataset-in-flink-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!