问题
In my project, I am looking to use a streaming pipeline in Google Dataflow in order to process Pub/Sub messages. In cleaning the input data, I am looking to also have a side input from BigQuery. This has presented a problem that will cause one of the two inputs to not work.
I have set in my Pipeline options for streaming=True, which allows the Pub/Sub inputs to process properly. But BigQuery is not compatible with streaming pipelines (see link below):
https://cloud.google.com/dataflow/docs/resources/faq#what_are_the_current_limitations_of_streaming_mode
I received this error: "ValueError: Cloud Pub/Sub is currently available for use only in streaming pipelines." This is understandable based on the limitations.
But I am only looking to use BigQuery as a side input in order to map data to the incoming Pub/Sub data stream. It works fine locally, but once I try to run it on Dataflow, it returns the error.
Has anyone found a good workaround for this?
EDIT: adding the framework of my pipeline below for reference:
# Set all options needed to properly run the pipeline
options = PipelineOptions(streaming=True,
runner='DataflowRunner',
project=project_id)
p = beam.Pipeline(options = options)
n_tbl_src = (p
| 'Nickname Table Read' >> beam.io.Read(beam.io.BigQuerySource(
table = nickname_spec
)))
# This is the main Dataflow pipeline. This will clean the incoming dataset for importing into BQ.
clean_vote = (p
| beam.io.gcp.pubsub.ReadFromPubSub(topic = None,
subscription = 'projects/{0}/subscriptions/{1}'
.format(project_id, subscription_name),
with_attributes = True)
| 'Isolate Attributes' >> beam.ParDo(IsolateAttrFn())
| 'Fix Value Types' >> beam.ParDo(FixTypesFn())
| 'Scrub First Name' >> beam.ParDo(ScrubFnameFn())
| 'Fix Nicknames' >> beam.ParDo(FixNicknameFn(), n_tbl=AsList(n_tbl_src))
| 'Scrub Last Name' >> beam.ParDo(ScrubLnameFn()))
# The final dictionary will then be written to BigQuery for storage
(clean_vote | 'Write to BQ' >> beam.io.WriteToBigQuery(
table = bq_spec,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition = beam.io.BigQueryDisposition.CREATE_NEVER
))
# Run the pipeline
p.run()
回答1:
@Pablo's comment above was the correct answer. For anyone working through the same situation, below is the change in my script that worked.
# This opens the Beam pipeline to run Dataflow
p = beam.Pipeline(options = options)
logging.info('Created Dataflow pipeline.')
# This will pull in all of the recorded nicknames to compare to the incoming PubSubMessages.
client = bigquery.Client()
query_job = client.query("""
select * from `{0}.{1}.{2}`""".format(project_id, dataset_id, nickname_table_id))
nickname_tbl = query_job.result()
nickname_tbl = [dict(row.items()) for row in nickname_tbl]
# This is the main Dataflow pipeline. This will clean the incoming dataset for importing into BQ.
clean_vote = (p
| beam.io.gcp.pubsub.ReadFromPubSub(topic = None,
subscription = 'projects/{0}/subscriptions/{1}'
.format(project_id, subscription_name),
with_attributes = True)
| 'Isolate Attributes' >> beam.ParDo(IsolateAttrFn())
| 'Fix Value Types' >> beam.ParDo(FixTypesFn())
| 'Scrub First Name' >> beam.ParDo(ScrubFnameFn())
| 'Fix Nicknames' >> beam.ParDo(FixNicknameFn(), n_tbl=nickname_tbl)
| 'Scrub Last Name' >> beam.ParDo(ScrubLnameFn()))
# The final dictionary will then be written to BigQuery for storage
(clean_vote | 'Write to BQ' >> beam.io.WriteToBigQuery(
table = bq_spec,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition = beam.io.BigQueryDisposition.CREATE_NEVER
))
# Run the pipeline
p.run()
来源:https://stackoverflow.com/questions/54130474/is-it-possible-to-have-both-pub-sub-and-bigquery-as-inputs-in-google-dataflow