apache-beam

How to use 'add_value_provider_argument' to initialise runtime parameter?

亡梦爱人 提交于 2021-01-04 09:05:02
问题 Take the official document 'Creating Templates' as an example: https://cloud.google.com/dataflow/docs/templates/creating-templates class WordcountOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): # Use add_value_provider_argument for arguments to be templatable # Use add_argument as usual for non-templatable arguments parser.add_value_provider_argument( '--input', default='gs://dataflow-samples/shakespeare/kinglear.txt', help='Path of the file to read from') parser

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

谁说我不能喝 提交于 2021-01-04 05:39:28
问题 I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G ). Spark submit sets driver memory to 30g and executor memory to 16g. However, executors fail with java.lang.OutOfMemoryError: Java heap space . The master gui indicates that memory per executor is 1024M. In addition, I see that all java processes are launched with -Xmx 1024m . This means spark-submit doesn't propagate it's executor settings to the executors. Pipeline

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

╄→尐↘猪︶ㄣ 提交于 2021-01-04 05:38:25
问题 I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G ). Spark submit sets driver memory to 30g and executor memory to 16g. However, executors fail with java.lang.OutOfMemoryError: Java heap space . The master gui indicates that memory per executor is 1024M. In addition, I see that all java processes are launched with -Xmx 1024m . This means spark-submit doesn't propagate it's executor settings to the executors. Pipeline

Apache beam DataFlow runner throwing setup error

吃可爱长大的小学妹 提交于 2021-01-03 18:31:28
问题 We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error, A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information. But could not find detailed worker-startup logs. We tried increasing memory size, worker count etc, but still getting the same error. Here is the command we use, python run.py \ --project=xyz \ --runner=DataflowRunner \ --staging

Apache beam DataFlow runner throwing setup error

倾然丶 夕夏残阳落幕 提交于 2021-01-03 18:29:55
问题 We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error, A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information. But could not find detailed worker-startup logs. We tried increasing memory size, worker count etc, but still getting the same error. Here is the command we use, python run.py \ --project=xyz \ --runner=DataflowRunner \ --staging

connect google cloud sql postgres instance from beam pipeline

妖精的绣舞 提交于 2020-12-29 07:52:26
问题 I want to connect google cloud sql postgres instance from apache beam pipeline running on google dataflow. I want to do this using Python SDK. I am not able to find proper documentation for this. In cloud SQL how to guide I dont see any documentation for dataflow. https://cloud.google.com/sql/docs/postgres/ Can someone provide documentation link/github example? 回答1: You can use the relational_db.Write and relational_db.Read transforms from beam-nuggets as follows: First install beam-nuggests:

Apache Beam + Dataflow too slow for only 18k data

萝らか妹 提交于 2020-12-15 06:44:10
问题 we need to execute heavy calculation on simple but numerous data. Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING). The DATA values are of the form "1#2#3#4#..." with 36 values. Ouput data are the same form, but DATA are just transformed by an algorithm. It's a "one for one" transformation. We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied. For our POC we use only 18k input

Apache Beam + Dataflow too slow for only 18k data

孤街醉人 提交于 2020-12-15 06:42:02
问题 we need to execute heavy calculation on simple but numerous data. Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING). The DATA values are of the form "1#2#3#4#..." with 36 values. Ouput data are the same form, but DATA are just transformed by an algorithm. It's a "one for one" transformation. We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied. For our POC we use only 18k input

Using AutoValueSchema in Apache Beam PCollection gives `RuntimeException: Creator parameter arg0 Doesn't correspond to a schema field`

六月ゝ 毕业季﹏ 提交于 2020-12-13 04:46:24
问题 I am trying to have a PCollection of AutoValue-defined objects that I have created, and I've added the appropriate annotations to infer the Schema via DefaultSchema(AutoValueSchema.class) . Like so: @DefaultSchema(AutoValueSchema.class) @AutoValue public abstract class MyAutoClas { public abstract String getMyStr(); public abstract Integer getMyInt(); @CreateSchema public static MyAutoClass create(String myStr, Integer myInt) { return new AutoValue_MyAutoClass(myStr, myInt); } } I have a

Elasticsearch/dataflow - connection timeout after ~60 concurrent connection

六眼飞鱼酱① 提交于 2020-12-13 03:15:57
问题 We host elatsicsearch cluster on Elastic Cloud and call it from dataflow (GCP). Job works fine in dev but when we deploy to prod we're seeing lots of connection timeout on the client side. Traceback (most recent call last): File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py", line 159, in process File "/usr/local/lib/python3.7