Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

回眸只為那壹抹淺笑 提交于 2020-06-17 15:57:29

问题


Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ?

PCollectionView<List<String>> list_of_vins = pipeline
                  .apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(5))) // adjust polling rate
                  .apply(ParDo.of(new DoFn<Long, List<String>>() {
                    @ProcessElement
                    public void process(ProcessContext c) {
                      // Read entire DB, and output as a List<String>
                        final String uriString = "mongodb://$[username]:$[password]@$[hostlist]/$[database]?authSource=$[authSource]";
                        MongoClient mongoClient = MongoClients.create(uriString);
                        MongoDatabase mongoDB = mongoClient.getDatabase(options.getMongoDBHostName());
                        MongoCollection<Document> mongoCollection = mongoDB.getCollection(options.getMongoVinCollectionName());
                        c.output((List<String>) ((View) mongoCollection).asList());
                    }
                  })
                  .apply(Window.into(new GlobalWindows()).triggering(AfterPane.elementCountAtLeast(1))));

回答1:


You'll need to specify the types on the Window transform like this:

.apply(Window.<List<String>>into(...));



回答2:


@danielm and all,

I have updated my code and seems its working but few questions and required clarification to go ahead with this,

PCollection<String> list_of_vins_1 = pipeline
            // Generate a tick every 15 seconds
            .apply("Ticker", GenerateSequence.from(0).withRate(1, Duration.standardMinutes(2)))
            // Just to check if individual ticks are being generated once every day
            .apply("Read Data from Mongo DB",ParDo.of(new DoFn<Long, Document>() {
                    @ProcessElement
                    public void processElement(@Element Long tick, OutputReceiver<Document> out) {
                            // reading values from Mongo DB
                            out.output(mongoDocuments);
                        }
                    }
                }
            )).apply("Window", Window.<Document>into(new GlobalWindows()).triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))).discardingFiredPanes())
            .apply(ParDo.of(new ConvertDocuemntToStringFn()));

// convert to mongodata to list of string
PCollectionView<List<String>> list_of_data_1 = list_of_vins_1.apply(View.<String> asList());

I am able to read value from mongo db as per Ticker Duration mentioned but i am not sure this will increase my sideinput size. Like as i am passing this list_of_data_1 as a sideinput, in pipeline its shows that counts of elements added in increase.

Lets suppose if mongo db has 20000 collections and if this ticker runs every 2 mins then number of elements added will be 20000 multiply by number of times ticker runs i.e 20,000 + 20,0000 + 20,000 + ..... and so on.

So my question is Is every time elements got added in Side inputs or sideinput is refreshing and sideinput always has 20,000 values or whatever MongoDB has, is it appending or overriding ?



来源:https://stackoverflow.com/questions/62302183/apache-beam-refreshing-a-sideinput-which-i-am-reading-from-the-mongodb-using-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!