google-cloud-dataflow

Calling beam.io.WriteToBigQuery in a beam.DoFn

▼魔方 西西 提交于 2021-01-28 19:11:33
问题 I've created a dataflow template with some parameters. When I write the data to BigQuery, I would like to make use of these parameters to determine which table it is supposed to write to. I've tried calling WriteToBigQuery in a ParDo as suggested in the following link. How can I write to Big Query using a runtime value provider in Apache Beam? The pipeline ran successfully but it is not creating or loading data to BigQuery. Any idea what might be the issue? def run(): pipeline_options =

Beam streaming pipeline does not write files to bucket

我们两清 提交于 2021-01-28 18:18:52
问题 UI have a python streaming pipeline on GCP Dataflow that reads thousands of messages from a PubSub, like this: with beam.Pipeline(options=pipeline_options) as p: lines = p | "read" >> ReadFromPubSub(topic=str(job_options.inputTopic)) lines = lines | "decode" >> beam.Map(decode_message) lines = lines | "Parse" >> beam.Map(parse_json) lines = lines | beam.WindowInto(beam.window.FixedWindows(1*60)) lines = lines | "Add device id key" >> beam.Map(lambda elem: (elem.get('id'), elem)) lines = lines

triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery

点点圈 提交于 2021-01-28 18:14:03
问题 Unable to set triggering_frequency for Dataflow Streaming job. transformed | 'Write' >> beam.io.WriteToBigQuery( known_args.target_table, schema=schema, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, method=bigquery.WriteToBigQuery.Method.FILE_LOADS, triggering_frequency=5 ) Error: triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery 回答1: This is a bug. The WriteToBigQuery transform

How to handle operations on local files over multiple ParDo transforms in Apache Beam / Google Cloud DataFlow

耗尽温柔 提交于 2021-01-28 18:01:01
问题 I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file. The branched results are then combined and exported as text. This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow. The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage

Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

泪湿孤枕 提交于 2021-01-28 08:50:19
问题 Background We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query. It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a

How to add de-duplication to a streaming pipeline [apache-beam]

十年热恋 提交于 2021-01-28 08:04:21
问题 I have a working streaming pipeline in apache beam [python] that ingests data from pub/sub, performs enrichment in dataflow and passes it to big-query. Withing the streaming window, I would like to ensure that messages are not getting duplicated, (as pub/sub guarantees only at least once delivery). So, I figured I'd just use the distinct method from beam, but as soon as I use it my pipeline breaks (can't proceed any further, any local prints are also not visible). Here is my pipeline code:

What is the best way to show reports with firebase?

穿精又带淫゛_ 提交于 2021-01-28 07:54:06
问题 I am currently using Cloud Functions to do aggregation in Firebase so that whenever a certain type of data entry occurs, I aggregate it accordingly and store it to show our reports. There are following concerns with this approach:- Adding new reports would mean going over all the existing data and that could be expensive with firebase realtime database Making any changes to existing reports is also non-trivial I was considering a solution like Cloud Dataflow. However, one issue is that it is

How to ingest data from a GCS bucket via Dataflow as soon as a new file is put into it?

走远了吗. 提交于 2021-01-28 07:37:16
问题 I have a use case where I need to input data from google Cloud Storage bucket as soon as its made available in the form of a new file in a storage bucket via Dataflow . How do I trigger the execution of the Dataflow job as soon as the new data(file) becomes available or added to the storage bucket ? 回答1: If your pipelines are written in Java, then you can use Cloud Functions and Dataflow templating. I'm going to assume you're using 1.x SDK (it's also possible with 2.x) Write your Pipeline and

Google Cloud DataFlow job throws alert after few hours

坚强是说给别人听的谎言 提交于 2021-01-28 05:43:41
问题 Running a DataFlow streaming job using 2.11.0 release. I get the following authentication error after few hours: File "streaming_twitter.py", line 188, in <lambda> File "streaming_twitter.py", line 102, in estimate File "streaming_twitter.py", line 84, in estimate_aiplatform File "streaming_twitter.py", line 42, in get_service File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python2

Dataflow autoscale does not boost performance

空扰寡人 提交于 2021-01-28 03:24:12
问题 I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling. However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly. ^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in ^ Bytes sent from each