google-cloud-dataflow

Writing nested schema to BigQuery from Dataflow (Python)

旧街凉风 提交于 2021-02-07 07:08:21
问题 I have a Dataflow job to write to BigQuery. It works well for non-nested schema, however fails for the nested schema. Here is my Dataflow pipeline: pipeline_options = PipelineOptions() p = beam.Pipeline(options=pipeline_options) wordcount_options = pipeline_options.view_as(WordcountTemplatedOptions) schema = 'url: STRING,' \ 'ua: STRING,' \ 'method: STRING,' \ 'man: RECORD,' \ 'man.ip: RECORD,' \ 'man.ip.cc: STRING,' \ 'man.ip.city: STRING,' \ 'man.ip.as: INTEGER,' \ 'man.ip.country: STRING,'

Side output in ParDo | Apache Beam Python SDK

允我心安 提交于 2021-02-07 04:18:47
问题 As the documentation is only available for JAVA, I could not really understand what it means. It states - "While ParDo always produces a main output PCollection (as the return value from apply), you can also have your ParDo produce any number of additional output PCollections. If you choose to have multiple outputs, your ParDo will return all of the output PCollections (including the main output) bundled together. For example, in Java, the output PCollections are bundled in a type-safe

How to deal with “The template parameters are invalid” when launching a custom template using Cloud Dataflow REST API/Python

随声附和 提交于 2021-02-05 11:45:28
问题 I have been using Dataprep to build a Dataflow template. Running it from https://console.cloud.google.com/dataflow/createjob - no problems. It prompts for parameters (regional endpoint, input locations, output locations, custom location for temp files) and the metadata file basically hands me the answers. When I come to run the custom template from Python using the REST API, I am including the parameters like the below (lots of quote escaping) BODY = { "jobName": "{jobname}".format(jobname

apache beam.io.BigQuerySource use_standard_sql not working when running as dataflow runner

ぐ巨炮叔叔 提交于 2021-02-05 11:17:05
问题 I have a dataflow job where I will read from bigquery query first (in standard sql). It works perfectly in direct runner mode. However I tried to run this dataflow in dataflow runner mode and encountered this error : response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Thu, 24 Dec 2020 09:28:21 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff',

The Additional Paramates at Dataflow into Beam Pipeline

跟風遠走 提交于 2021-02-05 09:29:40
问题 I'm working on Dataflow, I already has build my custom pipeline via Python SDK. I would like to add the parameters at the Dataflow UI into my custom pipeline. using the Additional Parameters. Reference by https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#staticvalue Then I changed add_argument to add_value_provider_argument follow by google docs class CustomParams(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( "

The Additional Paramates at Dataflow into Beam Pipeline

喜你入骨 提交于 2021-02-05 09:29:01
问题 I'm working on Dataflow, I already has build my custom pipeline via Python SDK. I would like to add the parameters at the Dataflow UI into my custom pipeline. using the Additional Parameters. Reference by https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#staticvalue Then I changed add_argument to add_value_provider_argument follow by google docs class CustomParams(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( "

apache beam 2.19.0 not running on cloud dataflow anymore due to Could not find a version that satisfies the requirement setuptools>=40.8

蓝咒 提交于 2021-02-05 08:08:55
问题 Since a few days our python dataflow jobs result in an error on worker startup: "ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)" ERROR: Command errored out with exit status 1: /usr/local/bin/python3 /usr/local/lib/python3.5/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-qz0ogm1p/overlay --no-warn-script-location --no-binary :none: --only-binary :none: --no-index --find-links /var/opt/google/dataflow -

How can I write to Big Query using a runtime value provider in Apache Beam?

末鹿安然 提交于 2021-02-04 06:14:37
问题 EDIT: I got this to work using beam.io.WriteToBigQuery with the sink experimental option turned on. I actually had it on but my issue was I was trying to "build" the full table reference from two variables (dataset + table) wrapped in str(). This was taking the whole value provider arguments data as a string instead of calling the get() method to get just the value. OP I am trying to generate a Dataflow template to then call from a GCP Cloud Function.(For reference, my dataflow job is

Google Cloud Dataflow - From PubSub to Parquet

房东的猫 提交于 2021-01-29 17:46:17
问题 I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file. In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage

Apache Beam Wait.on JdbcIO.write with unbounded PCollection issue

与世无争的帅哥 提交于 2021-01-29 13:14:05
问题 I am trying use the below scenario with unbounded pCollection datasource(PubSub). https://issues.apache.org/jira/browse/BEAM-6732 I am able to write to DB1. DB2 is having a Wait.on DB1 (PCollection .withResults). But unfortunately DB2 is not getting updated. When I change the source to a bounded dummy PCollection, it works. Any input is appreciated. 回答1: As I mentioned on Jira, do you have any Windows using in your unbounded pipeline? The writing to another database starts only after the