google-cloud-dataflow

Dataflow template that reads input and schema from GCS as runtime arguments

旧巷老猫 提交于 2021-01-29 13:11:49
问题 I am trying to create a custom dataflow template that takes 3 runtime arguments. An input file and schema file location from gcs and bigquery datasink table. The input file seems to be read properly using the beam.io.textio.ReadFromText method. However, I need to feed the schema file (instead of hard-coding it inside the template by reading that from gcs as well. This schema also needs to be passed to beam.io.WriteToBigQuery This is my first time working with Dataflow and I am struggling to

What is the equivalent Data type for Numeric in apache.beam.sdk.schemas.Schema.FieldType

时光总嘲笑我的痴心妄想 提交于 2021-01-29 11:25:08
问题 Trying to write the data into BigQuery table using BeamSQL. To write the data we need schema of that data. Used org.apache.beam.sdk.schemas for defining schema of the data collection. We have Numeric data type column in that data collection. Want to know, what is the equivalent data type for Numeric in org.apache.beam.sdk.schemas.Schema.FieldType class. Some one please help me use the equivalent schema of Numeric data type. 回答1: BeamSQL's Decimal can present BigQuery's NUMERIC. BeamSQL's

ClassCastException when reading nested list of records

五迷三道 提交于 2021-01-29 10:42:57
问题 I am reading in a BigQuery table from Dataflow where one of the fields is a "record" and "repeated" field. So I expected the resulting data type in Java to be List<TableRow> . However when I try to iterate over the list I get the following exception: java.lang.ClassCastException: java.util.LinkedHashMap cannot be cast to com.google.api.services.bigquery.model.TableRow The table schema looks something like this: { "id": "my_id", "values": [ { "nested_record": "nested" } ] } The code to iterate

beam.io.WriteToText add new line after each value - can it be removed?

假装没事ソ 提交于 2021-01-29 10:23:39
问题 My pipeline looks similar to the following: parDo return list per processed line | beam.io.WriteToText beam.io.WriteToText adds a new line after each list element. How can I remove this new line and have the values separated by comma so I will be able to build CSV file Any help is very appreciated! Thanks, eilalan 回答1: To remove the newline char, you can use this: beam.io.WriteToText(append_trailing_newlines=False) But for adding commas between your values, there's no out-of-the-box feature

How to load my pickeled ML model from GCS to Dataflow/Apache beam

爷,独闯天下 提交于 2021-01-29 10:07:07
问题 I've developed an apache beam pipeline locally where I run predictions on a sample file. Locally on my computer I can load the model like this: with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid: gnb_loaded = cPickle.load(fid) but when running on google dataflow that doesn't obviously work. I tried changing the path to GS:// but that also obviously does not work. I also tried this code snippet (from here) that was used to load files: class ReadGcsBlobs(beam.DoFn): def

How to I stage a GCP/Apache Beam Dataflow template?

夙愿已清 提交于 2021-01-29 09:00:27
问题 Ok I have to be missing something here. What do i need to stage a pipeline as a template? When I try to stage my template with via these instructions, it runs the module but doesn't stage anything., it appears to function as expected without errors, but I don't see any files actually get added to the bucket location listen in my --template_location. Should my python code be showing up there? I assume so right? I have made sure i have all the beam and google cloud SDKs installed, but maybe I'm

What is “Error in SQL Launcher” means when using Dataflow SQL UI?

偶尔善良 提交于 2021-01-29 08:38:05
问题 I tried to create a Dataflow job using the Dataflow SQL UI. I followed the Using Dataflow SQL Tutorial and the job ran properly. I changed the data source to a BigQuery table. My plan is: Query from the BigQuery table Save the result back to a BigQuery Table. When I creating the dataflow job, I got the error message: Error in SQL Launcer What does the error mean? Thanks for your help! 回答1: Dataflow SQL does not yet support the DATE type (there was a bug in the SQL Validator that didn't catch

PubSub to Spanner Streaming Pipeline

若如初见. 提交于 2021-01-29 08:31:36
问题 I am trying to stream PubSub message of type JSON to spanner database and the insert_update works very well. Spanner table has composite primary key, so need to delete the existing data before inserting new data from PubSub (so only latest data is present). Spanner replace or insert/update mutations does not work in this case. I added pipeline import org.apache.beam.* ; public class PubSubToSpannerPipeline { // JSON to TableData Object public static class PubSubToTableDataFn extends DoFn

Dataflow Python SDK Avro Source/Sync

有些话、适合烂在心里 提交于 2021-01-29 03:00:28
问题 I am looking to ingest and write Avro files in GCS with the Python SDK. Is this currently possible with Avro leveraging the Python SDK? If so how would I do this? I see TODO comments in the source regarding this so I am not too optimistic. 回答1: You are correct: the Python SDK does not yet support this, but it will soon. 回答2: As of version 2.6.0 of the Apache Beam/Dataflow Python SDK, it is indeed possible to read (and write to) avro files in GCS. Even better, the Python SDK for Beam now

Writing failed row inserts in a streaming job to bigquery using apache beam JAVA SDK?

白昼怎懂夜的黑 提交于 2021-01-29 02:40:03
问题 While running a streaming job its always good to have logs of rows which were not processed while inserting into big query. Catching and write those into another big query table will give an idea for what went wrong. Below are the steps that you can try to achieve the same. 回答1: Pre-requisites: apache-beam >= 2.10.0 or latest Using the getFailedInsertsWithErr() function available in the sdk you can easily catch the failed inserts and push to another table for performing RCA. This becomes an