Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

白昼怎懂夜的黑 提交于 2020-06-29 04:20:09

问题


I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection)

And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be. And i need that newly added value in my pipeline.

We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection then i need that three more values in my pileine which will be 20,003 in total.

Currently my pipeline looks like this.

PCollection<String> mongodata =  pipeline.apply(MongoDbIO.read()
                .withUri(options.getMongoDBHostName())
                .withDatabase(options.getMongoDBDatabaseName())
                .withCollection(options.getMongoVinCollectionName()))
                .apply(ParDo.of(new ConvertDocuemntToStringFn()));

PCollectionView<List<String>> list_of_data = mongodata.apply(View.<String> asList());

PCollection<PubsubMessage>  pubsubMessagePCollection = controller.flattenPubSubPCollection(
                controller.fetchDataFromBucket(options),pipeline);

pubsubMessagePCollection.apply("Convert pubsub to kv,k=vin",ParDo.of(new ConvertPubsubToKVFn()))
                .apply("group by vin key",GroupByKey.<String,String>create())
                .apply("converting message to document type",ParDo.of(
                        new ConvertMessageToDocumentTypeFn(list_of_data)).withSideInputs(list_of_data))
                .apply(MongoDbIO.write()
                .withUri(options.getMongoDBHostName())
                .withDatabase(options.getMongoDBDatabaseName())
                .withCollection(CollectionA));
pipeline.run();

I want this mongodata (list_of_data) will refresh based on the requirement of updating from the backend without stopping the pipeline.

I tried looking the approach of GenerateSequence or triggering but unable to find the exact code to test this please help and provide me the updated code if you can by adding suitable code to resolve my queries.

Please do let me know if need more info.

thanks


回答1:


You'll want to use GenerateSequence to periodically create elements, have a ParDo that reads the MongoDB, then window into GlobalWindows with an appropriate trigger. I don't think you'll be able to use MongoDbIO directly, since it doesn't support running in the middle of a pipeline like this. The code will be something like:

PCollectionView<List<String>> list_of_data = pipeline
  .apply(GenerateSequence.from(0).withRate(1, Duration.hours(24))) // adjust polling rate
  .apply(ParDo.of(new DoFn<Long, List<String>>() {
    @ProcessElement
    public void process(@Element long unused) {
      // Read entire DB, and output as a List<String>
    }
  })
  .apply(Window.into(new GlobalWindows()).triggering(AfterPane.elementCountAtLeast(1))));


来源:https://stackoverflow.com/questions/62285067/apache-beam-refreshing-a-sideinput-which-i-am-reading-from-the-mongodb-using-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!