google-cloud-dataflow

Is it possible to have both Pub/Sub and BigQuery as inputs in Google Dataflow?

最后都变了- 提交于 2021-01-28 03:02:39
问题 In my project, I am looking to use a streaming pipeline in Google Dataflow in order to process Pub/Sub messages. In cleaning the input data, I am looking to also have a side input from BigQuery. This has presented a problem that will cause one of the two inputs to not work. I have set in my Pipeline options for streaming=True, which allows the Pub/Sub inputs to process properly. But BigQuery is not compatible with streaming pipelines (see link below): https://cloud.google.com/dataflow/docs

Dataflow autoscale does not boost performance

蹲街弑〆低调 提交于 2021-01-28 01:42:49
问题 I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling. However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly. ^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in ^ Bytes sent from each

Execute a process exactly after BigQueryIO.write() operation

社会主义新天地 提交于 2021-01-28 00:02:57
问题 I have a pipeline with a BigQuery table as sink. I need to perform some steps exactly after data has been written to BigQuery. Those steps include performing queries on that table, read data from it and write to a different table. How to achieve the above? Should I create a different pipeline for the latter but then calling it after the 1st pipeline will be another problem I assume. If none of the above work, is it possible to call another dataflow job(template) from a running pipeline.

How can I encode characters like emojis as UTF8 without unpaired surrogate characters?

為{幸葍}努か 提交于 2021-01-27 17:23:54
问题 I have strings with a variety of characters that need to be written to Google BigQuery, which requires strict UTF8 strings. When trying to write strings with a wide variety of emoji input, I get an error: java.lang.IllegalArgumentException: Unpaired surrogate at index 3373 at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLengthGeneral(Utf8.java:93) at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLength(Utf8.java:67) at org.apache.beam.sdk.coders

Processing stuck when writing to BigQuery

不打扰是莪最后的温柔 提交于 2021-01-27 09:54:35
问题 I'm using cloud Dataflow to import data from Pub/Sub messages to BigQuery tables. I'm using DynamicDestinations since these messages can be put into different tables. I've recently noticed that the process started consuming all resources and messages stating that the process is stuck started showing: Processing stuck in step Write Avros to BigQuery Table/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 26h45m00s without outputting or completing in state finish at sun.misc

Processing stuck when writing to BigQuery

依然范特西╮ 提交于 2021-01-27 09:51:46
问题 I'm using cloud Dataflow to import data from Pub/Sub messages to BigQuery tables. I'm using DynamicDestinations since these messages can be put into different tables. I've recently noticed that the process started consuming all resources and messages stating that the process is stuck started showing: Processing stuck in step Write Avros to BigQuery Table/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 26h45m00s without outputting or completing in state finish at sun.misc

Reading BigQuery federated table as source in Dataflow throws an error

Deadly 提交于 2021-01-27 07:09:58
问题 I have a federated source in BigQuery which is pointing to some CSV files in GCS. When I try to read to the federated BigQuery table as a source for a Dataflow pipeline, it throws the following error: 1226 [main] ERROR com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator - Error reading from BigQuery table Federated_test_dataflow of dataset CPT_7414_PLAYGROUND : 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Cannot list a table of type EXTERNAL.",

Reading BigQuery federated table as source in Dataflow throws an error

这一生的挚爱 提交于 2021-01-27 07:07:46
问题 I have a federated source in BigQuery which is pointing to some CSV files in GCS. When I try to read to the federated BigQuery table as a source for a Dataflow pipeline, it throws the following error: 1226 [main] ERROR com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator - Error reading from BigQuery table Federated_test_dataflow of dataset CPT_7414_PLAYGROUND : 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Cannot list a table of type EXTERNAL.",

Sending credentials to Google Dataflow jobs

天涯浪子 提交于 2021-01-27 05:51:38
问题 What is the right way to pass credentials to Dataflow jobs? Some of my Dataflow jobs need credentials to make REST calls and fetch/post processed data. I am currently using environment variables to pass the credentials to the JVM, read them into a Serializable object and pass them on to the DoFn implementation's constructor. I am not sure this is the right approach as any class which is Serializable should not contain sensitive information. Another way I thought of is to store the credential

Can I make flex template jobs take less than 10 minutes before they start to process data?

拜拜、爱过 提交于 2021-01-24 09:38:32
问题 I am using terraform resource google_dataflow_flex_template_job to deploy a Dataflow flex template job. resource "google_dataflow_flex_template_job" "streaming_beam" { provider = google-beta name = "streaming-beam" container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path parameters = { "input_subscription" = google_pubsub_subscription.ratings[0].id "output_table" = "${var.project}:beam_samples.streaming_beam_sql" "service_account_email" = data.terraform