apache-beam

apache beam 2.19.0 not running on cloud dataflow anymore due to Could not find a version that satisfies the requirement setuptools>=40.8

蓝咒 提交于 2021-02-05 08:08:55
问题 Since a few days our python dataflow jobs result in an error on worker startup: "ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)" ERROR: Command errored out with exit status 1: /usr/local/bin/python3 /usr/local/lib/python3.5/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-qz0ogm1p/overlay --no-warn-script-location --no-binary :none: --only-binary :none: --no-index --find-links /var/opt/google/dataflow -

How can I write to Big Query using a runtime value provider in Apache Beam?

末鹿安然 提交于 2021-02-04 06:14:37
问题 EDIT: I got this to work using beam.io.WriteToBigQuery with the sink experimental option turned on. I actually had it on but my issue was I was trying to "build" the full table reference from two variables (dataset + table) wrapped in str(). This was taking the whole value provider arguments data as a string instead of calling the get() method to get just the value. OP I am trying to generate a Dataflow template to then call from a GCP Cloud Function.(For reference, my dataflow job is

Google Cloud Dataflow - From PubSub to Parquet

房东的猫 提交于 2021-01-29 17:46:17
问题 I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file. In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage

How does Apache Beam's CombineValues operate over elements when executing arithmetic operations

a 夏天 提交于 2021-01-29 14:57:02
问题 This is a bit of a contrived example, but I have been exploring the docs for CombineValues and wish understand what I'm seeing. If I combine values and perform some arithmetic operations on the values (the goal is to calculate percentages of keys present in a bounded stream), then I need to use the AverageFn (as defined in Example 8 in docs and provided in the source code example snippets). However, this (based on Example 5 ) does not work: with beam.Pipeline() as pipeline: counts = (

What do the “|” and “>>” means in Apache Beam?

故事扮演 提交于 2021-01-29 13:22:41
问题 I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection. . I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo . The code looks like this: import apache_beam as beam

Apache Beam Wait.on JdbcIO.write with unbounded PCollection issue

与世无争的帅哥 提交于 2021-01-29 13:14:05
问题 I am trying use the below scenario with unbounded pCollection datasource(PubSub). https://issues.apache.org/jira/browse/BEAM-6732 I am able to write to DB1. DB2 is having a Wait.on DB1 (PCollection .withResults). But unfortunately DB2 is not getting updated. When I change the source to a bounded dummy PCollection, it works. Any input is appreciated. 回答1: As I mentioned on Jira, do you have any Windows using in your unbounded pipeline? The writing to another database starts only after the

Dataflow template that reads input and schema from GCS as runtime arguments

旧巷老猫 提交于 2021-01-29 13:11:49
问题 I am trying to create a custom dataflow template that takes 3 runtime arguments. An input file and schema file location from gcs and bigquery datasink table. The input file seems to be read properly using the beam.io.textio.ReadFromText method. However, I need to feed the schema file (instead of hard-coding it inside the template by reading that from gcs as well. This schema also needs to be passed to beam.io.WriteToBigQuery This is my first time working with Dataflow and I am struggling to

What is the equivalent Data type for Numeric in apache.beam.sdk.schemas.Schema.FieldType

时光总嘲笑我的痴心妄想 提交于 2021-01-29 11:25:08
问题 Trying to write the data into BigQuery table using BeamSQL. To write the data we need schema of that data. Used org.apache.beam.sdk.schemas for defining schema of the data collection. We have Numeric data type column in that data collection. Want to know, what is the equivalent data type for Numeric in org.apache.beam.sdk.schemas.Schema.FieldType class. Some one please help me use the equivalent schema of Numeric data type. 回答1: BeamSQL's Decimal can present BigQuery's NUMERIC. BeamSQL's

beam.io.WriteToText add new line after each value - can it be removed?

假装没事ソ 提交于 2021-01-29 10:23:39
问题 My pipeline looks similar to the following: parDo return list per processed line | beam.io.WriteToText beam.io.WriteToText adds a new line after each list element. How can I remove this new line and have the values separated by comma so I will be able to build CSV file Any help is very appreciated! Thanks, eilalan 回答1: To remove the newline char, you can use this: beam.io.WriteToText(append_trailing_newlines=False) But for adding commas between your values, there's no out-of-the-box feature

How to load my pickeled ML model from GCS to Dataflow/Apache beam

爷,独闯天下 提交于 2021-01-29 10:07:07
问题 I've developed an apache beam pipeline locally where I run predictions on a sample file. Locally on my computer I can load the model like this: with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid: gnb_loaded = cPickle.load(fid) but when running on google dataflow that doesn't obviously work. I tried changing the path to GS:// but that also obviously does not work. I also tried this code snippet (from here) that was used to load files: class ReadGcsBlobs(beam.DoFn): def