问题
I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB).
For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not.
An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by user id and then do a look up for each event in the stream into this DataSet.
Since Flink does not allow using DataSets in a streaming program, how can I achieve the above ?
Another option could be to use Managed Operator State to store buyers set, but how can I keep this state distributed by user id so as to avoid network i/o in individual event look ups ? In case of memory state backend, does state remain distributed by some key, or is it replicated across all operator subtasks ?
What is the right design pattern to achieve the above enriching requirement in a Flink streaming program ?
回答1:
I would key the stream by user_id, and use a RichFlatMap to do the enrichment. In the open() method of the RichFlatMap you can load the static buyer flag for that user and keep it cached in a boolean field.
来源:https://stackoverflow.com/questions/49645285/enriching-datastream-using-static-dataset-in-flink-streaming