Flink: how to store state and use in another stream?

不羁岁月 提交于 2019-12-10 07:06:14

问题


I have a use-case for Flink where I need to read information from a file, store each line, and then use this state to filter another stream.

I have all of this working right now with the connect operator and a RichCoFlatMapFunction, but it feels overly complicated. Also, I'm concerned that flatMap2 could begin executing before all of the state is loaded from the file:

fileStream
    .connect(partRecordStream.keyBy((KeySelector<PartRecord, String>) partRecord -> partRecord.getPartId()))
    .keyBy((KeySelector<String, String>) partId -> partId, (KeySelector<PartRecord, String>) partRecord -> partRecord.getPartId())
    .flatMap(new RichCoFlatMapFunction<String, PartRecord, PartRecord>() {
        private transient ValueState<String> storedPartId;
        @Override
        public void flatMap1(String partId, Collector<PartRecord> out) throws Exception {
            // store state
            storedPartId.update(partId);
        }

        @Override
        public void flatMap2(PartRecord record, Collector<PartRecord> out) throws Exception {
            if (record.getPartId().equals(storedPartId.value())) {
                out.collect(record);
            } else {
                // do nothing
            }
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            ValueStateDescriptor<String> descriptor =
                    new ValueStateDescriptor<>(
                            "partId", // the state name
                            TypeInformation.of(new TypeHint<String>() {}),
                            null);
            storedPartId = getRuntimeContext().getState(descriptor);
        }
    });

Is there a better way (as of Flink 1.1.3) to accomplish this pattern of loading state, then using it in subsequent streams?


回答1:


Your concerns about the CoFlatMapFunction are correct. The order in which flatMap1 and flatMap2 are called cannot be controlled and depend on the order in which data arrives. So, flatMap2 might be called before all data has been read by flatMap1.

The only way in Flink 1.1.3 to read all data before starting to process a stream is to consume the data in the open() method of a RichFlatMapFunction, i.e., you have to manually read and parse the file.

This is basically a broadcast join strategy, i.e., each parallel instance of the operator will do this. The drawback is that the data of the file will be replicated. The benefit is that you do not have to shuffle the "main" stream (no need to use keyBy()).



来源:https://stackoverflow.com/questions/41064059/flink-how-to-store-state-and-use-in-another-stream

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!