问题
UI have a python streaming pipeline on GCP Dataflow that reads thousands of messages from a PubSub, like this:
with beam.Pipeline(options=pipeline_options) as p:
lines = p | "read" >> ReadFromPubSub(topic=str(job_options.inputTopic))
lines = lines | "decode" >> beam.Map(decode_message)
lines = lines | "Parse" >> beam.Map(parse_json)
lines = lines | beam.WindowInto(beam.window.FixedWindows(1*60))
lines = lines | "Add device id key" >> beam.Map(lambda elem: (elem.get('id'), elem))
lines = lines | "Group by key" >> beam.GroupByKey()
lines = lines | "Abandon key" >> beam.Map(flatten)
lines | "WriteToAvro" >> beam.io.WriteToAvro(job_options.outputLocation, schema=schema, file_name_suffix='.avro', mime_type='application/x-avro')
The pipeline runs just fine, except it never produces any output. Any ideas why?
回答1:
It looks like there were a few problems with your code. First, there was some badly formatted data with regards to null/None (you fixed already) and ints/floats (called out in comments). Finally, the WriteToAvro transform cannot write unbounded PCollections. There is a work-around in which you define a new sink and use that with the WriteToFiles transform which is able to write unbounded PCollections.
Note that as of the writing of this post (2020-06-18), this method does not work with the Apache Beam Python SDK <= 2.23. This is because the Python pickler cannot deserialize a pickled Avro schema (see BEAM-6522). In this case, this forces a solution to use FastAvro instead. You can use Avro if you manually upgrade dill to >= 0.3.1.1 and Avro to >= 1.9.0, but be careful as this is currently untested.
With the caveats out of the way, here is the work-around:
from apache_beam.io.fileio import FileSink
from apache_beam.io.fileio import WriteToFiles
import fastavro
class AvroFileSink(FileSink):
def __init__(self, schema, codec='deflate'):
self._schema = schema
self._codec = codec
def open(self, fh):
# This is called on every new bundle.
self.writer = fastavro.write.Writer(fh, self._schema, self._codec)
def write(self, record):
# This is called on every element.
self.writer.write(record)
def flush(self):
self.writer.flush()
This new sink is used like the following:
import apache_beam as beam
# Replace the following with your schema.
schema = fastavro.schema.parse_schema({
'name': 'row',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'a', 'type': 'int'},
],
})
# Create the sink. This will be used by the WriteToFiles transform to write
# individual elements to the Avro file.
sink = AvroFileSink(schema=schema)
with beam.Pipeline(...) as p:
lines = p | beam.ReadFromPubSub(...)
lines = ...
# This is where your new sink gets used. The WriteToFiles transform takes
# the sink and uses it to write to a directory defined by the path
# argument.
lines | WriteToFiles(path=job_options.outputLocation, sink=sink)
来源:https://stackoverflow.com/questions/62431944/beam-streaming-pipeline-does-not-write-files-to-bucket