google-cloud-dataflow | 易学教程

Calling beam.io.WriteToBigQuery in a beam.DoFn

阅读更多关于 Calling beam.io.WriteToBigQuery in a beam.DoFn

问题 I've created a dataflow template with some parameters. When I write the data to BigQuery, I would like to make use of these parameters to determine which table it is supposed to write to. I've tried calling WriteToBigQuery in a ParDo as suggested in the following link. How can I write to Big Query using a runtime value provider in Apache Beam? The pipeline ran successfully but it is not creating or loading data to BigQuery. Any idea what might be the issue? def run(): pipeline_options =

Beam streaming pipeline does not write files to bucket

阅读更多关于 Beam streaming pipeline does not write files to bucket

问题 UI have a python streaming pipeline on GCP Dataflow that reads thousands of messages from a PubSub, like this: with beam.Pipeline(options=pipeline_options) as p: lines = p | "read" >> ReadFromPubSub(topic=str(job_options.inputTopic)) lines = lines | "decode" >> beam.Map(decode_message) lines = lines | "Parse" >> beam.Map(parse_json) lines = lines | beam.WindowInto(beam.window.FixedWindows(1*60)) lines = lines | "Add device id key" >> beam.Map(lambda elem: (elem.get('id'), elem)) lines = lines

triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery

阅读更多关于 triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery

问题 Unable to set triggering_frequency for Dataflow Streaming job. transformed | 'Write' >> beam.io.WriteToBigQuery( known_args.target_table, schema=schema, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, method=bigquery.WriteToBigQuery.Method.FILE_LOADS, triggering_frequency=5 ) Error: triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery 回答1: This is a bug. The WriteToBigQuery transform

How to handle operations on local files over multiple ParDo transforms in Apache Beam / Google Cloud DataFlow

阅读更多关于 How to handle operations on local files over multiple ParDo transforms in Apache Beam / Google Cloud DataFlow

问题 I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file. The branched results are then combined and exported as text. This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow. The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage

Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

阅读更多关于 Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

问题 Background We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query. It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a

How to add de-duplication to a streaming pipeline [apache-beam]

阅读更多关于 How to add de-duplication to a streaming pipeline [apache-beam]

问题 I have a working streaming pipeline in apache beam [python] that ingests data from pub/sub, performs enrichment in dataflow and passes it to big-query. Withing the streaming window, I would like to ensure that messages are not getting duplicated, (as pub/sub guarantees only at least once delivery). So, I figured I'd just use the distinct method from beam, but as soon as I use it my pipeline breaks (can't proceed any further, any local prints are also not visible). Here is my pipeline code:

What is the best way to show reports with firebase?

阅读更多关于 What is the best way to show reports with firebase?

问题 I am currently using Cloud Functions to do aggregation in Firebase so that whenever a certain type of data entry occurs, I aggregate it accordingly and store it to show our reports. There are following concerns with this approach:- Adding new reports would mean going over all the existing data and that could be expensive with firebase realtime database Making any changes to existing reports is also non-trivial I was considering a solution like Cloud Dataflow. However, one issue is that it is

How to ingest data from a GCS bucket via Dataflow as soon as a new file is put into it?

阅读更多关于 How to ingest data from a GCS bucket via Dataflow as soon as a new file is put into it?

问题 I have a use case where I need to input data from google Cloud Storage bucket as soon as its made available in the form of a new file in a storage bucket via Dataflow . How do I trigger the execution of the Dataflow job as soon as the new data(file) becomes available or added to the storage bucket ? 回答1: If your pipelines are written in Java, then you can use Cloud Functions and Dataflow templating. I'm going to assume you're using 1.x SDK (it's also possible with 2.x) Write your Pipeline and

Google Cloud DataFlow job throws alert after few hours

阅读更多关于 Google Cloud DataFlow job throws alert after few hours

问题 Running a DataFlow streaming job using 2.11.0 release. I get the following authentication error after few hours: File "streaming_twitter.py", line 188, in <lambda> File "streaming_twitter.py", line 102, in estimate File "streaming_twitter.py", line 84, in estimate_aiplatform File "streaming_twitter.py", line 42, in get_service File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python2

Dataflow autoscale does not boost performance

阅读更多关于 Dataflow autoscale does not boost performance

问题 I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling. However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly. ^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in ^ Bytes sent from each