dataflow

Dataflow TPL Implementing pipeline with precondition

[亡魂溺海] 提交于 2020-07-16 06:17:25
问题 I have a question about implementing pipeline using Dataflow TPL library. My case is that I have a software that needs to process some tasks concurrently. Processing looks like this: first we process album at global level, and then we go inside album and process each picture individually. Let's say that application has got processing slots and they are configurable (for the sake of example assume slots = 2). This means that application can process either: a) two albums on the same time b) one

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

白昼怎懂夜的黑 提交于 2020-06-29 04:20:09
问题 I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection) And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be . And i need that newly added value in my pipeline. We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection

How to call TriggerBatch automagically after a timeout if the number of queued items is less than the BatchSize?

旧巷老猫 提交于 2020-06-24 04:13:33
问题 Using Dataflow CTP (in the TPL) Is there a way to call BatchBlock.TriggerBatch automatically if the number of currently queued or postponed items is less than the BatchSize, after a timeout ? And better: this timeout should be reset to 0 each time the block receives a new item. 回答1: Yes, you can accomplish this rather elegantly by chaining together blocks. In this case you want to setup a TransformBlock which you link "before" the BatchBlock. That would look something like this: Timer

How to call TriggerBatch automagically after a timeout if the number of queued items is less than the BatchSize?

自作多情 提交于 2020-06-24 04:12:09
问题 Using Dataflow CTP (in the TPL) Is there a way to call BatchBlock.TriggerBatch automatically if the number of currently queued or postponed items is less than the BatchSize, after a timeout ? And better: this timeout should be reset to 0 each time the block receives a new item. 回答1: Yes, you can accomplish this rather elegantly by chaining together blocks. In this case you want to setup a TransformBlock which you link "before" the BatchBlock. That would look something like this: Timer

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

回眸只為那壹抹淺笑 提交于 2020-06-17 15:57:29
问题 Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ? PCollectionView<List<String>> list_of

Avoiding use of ActionBlock<TInput>.Post when PostDataflowBlockOptions.BoundedCapacity is not the default value?

十年热恋 提交于 2020-05-30 08:14:09
问题 The bounty expires in 6 days . Answers to this question are eligible for a +100 reputation bounty. SpiritBob wants to reward an existing answer . I've heard that you can lose information if you use the Post method instead of the SendAsync method of an ActionBlock<T> object, when you decide to utilize it's BoundedCapacity property. Could someone please explain why that is so? 回答1: The Post method attempts to post an item synchronously and returns true or false , depending on whether the block

Avoiding use of ActionBlock<TInput>.Post when PostDataflowBlockOptions.BoundedCapacity is not the default value?

允我心安 提交于 2020-05-30 08:13:09
问题 The bounty expires in 6 days . Answers to this question are eligible for a +100 reputation bounty. SpiritBob wants to reward an existing answer . I've heard that you can lose information if you use the Post method instead of the SendAsync method of an ActionBlock<T> object, when you decide to utilize it's BoundedCapacity property. Could someone please explain why that is so? 回答1: The Post method attempts to post an item synchronously and returns true or false , depending on whether the block

Avoiding use of ActionBlock<TInput>.Post when PostDataflowBlockOptions.BoundedCapacity is not the default value?

本小妞迷上赌 提交于 2020-05-30 08:13:07
问题 The bounty expires in 6 days . Answers to this question are eligible for a +100 reputation bounty. SpiritBob wants to reward an existing answer . I've heard that you can lose information if you use the Post method instead of the SendAsync method of an ActionBlock<T> object, when you decide to utilize it's BoundedCapacity property. Could someone please explain why that is so? 回答1: The Post method attempts to post an item synchronously and returns true or false , depending on whether the block

Array type in clickhouseIO for apache beam(dataflow)

江枫思渺然 提交于 2020-05-17 07:55:26
问题 I am using Apache Beam to consume json and insert into clickhouse. I am currently having a problem with the Array data type. Everything works fine before I add an array type of field Schema.Field.of("inputs.value", Schema.FieldType.array(Schema.FieldType.INT64).withNullable(true)) Code for the transformations p.apply(transformNameSuffix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription(chainConfig.getPubSubSubscriptionPrefix() + "transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE))

Dataflow job state/Scheduling and Options

你离开我真会死。 提交于 2020-05-17 07:04:28
问题 I am trying to understand Dataflow staging and execution design. It seems like a primary use case is not being supported, but perhaps I am lacking a general understanding of the intended design. My Goal : I want to execute my Dataflow pipeline on a regular interval as a bounded/batch job. I have an option time range argument that allows me to run the same pipeline for specific historical backfill or on an hourly basis. This argument is supposed to update the BigQuery SQL query in the pipeline