apache-beam

Beam: Failed to serialize and deserialize property 'awsCredentialsProvider

二次信任 提交于 2021-01-28 06:37:11
问题 I have been using a Beam pipeline examples as a guide in an attempt to load files from S3 for my pipeline. Like in the examples I have defined my own PipelineOptions that also extends S3Options and I am attempting to use the DefaultAWSCredentialsProviderChain. The code to configure this is: MyPipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(MyPipelineOptions.class); options.setAwsCredentialsProvider(new DefaultAWSCredentialsProviderChain()); options.setAwsRegion("us-east-1")

Google Cloud DataFlow job throws alert after few hours

坚强是说给别人听的谎言 提交于 2021-01-28 05:43:41
问题 Running a DataFlow streaming job using 2.11.0 release. I get the following authentication error after few hours: File "streaming_twitter.py", line 188, in <lambda> File "streaming_twitter.py", line 102, in estimate File "streaming_twitter.py", line 84, in estimate_aiplatform File "streaming_twitter.py", line 42, in get_service File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python2

Dataflow autoscale does not boost performance

空扰寡人 提交于 2021-01-28 03:24:12
问题 I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling. However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly. ^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in ^ Bytes sent from each

Dataflow autoscale does not boost performance

蹲街弑〆低调 提交于 2021-01-28 01:42:49
问题 I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling. However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly. ^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in ^ Bytes sent from each

Execute a process exactly after BigQueryIO.write() operation

社会主义新天地 提交于 2021-01-28 00:02:57
问题 I have a pipeline with a BigQuery table as sink. I need to perform some steps exactly after data has been written to BigQuery. Those steps include performing queries on that table, read data from it and write to a different table. How to achieve the above? Should I create a different pipeline for the latter but then calling it after the 1st pipeline will be another problem I assume. If none of the above work, is it possible to call another dataflow job(template) from a running pipeline.

Thread Synchronization for DoFn in Apache Beam

别说谁变了你拦得住时间么 提交于 2021-01-27 20:06:49
问题 I am writing a DoFn in which its instance variable elements (i.e., a shared resource) can be mutated in the @ProcessElement method: import java.util.ArrayList; import java.util.List; import org.apache.beam.sdk.transforms.DoFn; public class DemoDoFn extends DoFn<String, Void> { private final int batchSize; private transient List<String> elements; public DemoDoFn(int batchSize) { this.batchSize = batchSize; } @StartBundle public void startBundle() { elements = new ArrayList<>(); }

How to render a pipeline graph in Beam?

六月ゝ 毕业季﹏ 提交于 2021-01-27 18:53:44
问题 Using Apache Beam Python SDK version 2.9.0, is it possible to get a renderable pipeline graph representation similar to Google’s dataflow instead of running it? I have difficulties to assemble complex pipelines and I would be happy to see an assembled pipeline before trying to execute it using DirectRunner . 回答1: Have a look at this unit test. This should give you an example of how this works with Python SDK. TextRenderer simply returns the dot representation in text format. There is also an

Apache Beam TextIO.ReadAll How to emit KeyValue instead of String of Pcollection

荒凉一梦 提交于 2021-01-27 18:43:04
问题 Pipeline Starts by Reading from PUBSUBIo. The message inside PubSub IO is a GCS file path. I know that I can use ReadAll() to emit the lines from each path. However, it doesn't serve my purpose( Information regarding the file path is lost ). What I need is to emit is a KV<'Filepath','Lines inside files'> . PubSUB messages will look like Message1 -> gs://folder1/Topic1/topicfile1.gz Message2 -> gs://folder1/Topic2/topicfile2.gz Assume that the file contents are like below topicfile1.gz {

Apache Beam/Dataflow Reshuffle

这一生的挚爱 提交于 2021-01-22 04:28:42
问题 What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id. What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual

Apache Beam/Dataflow Reshuffle

谁都会走 提交于 2021-01-22 04:25:59
问题 What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id. What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual