How to create a Dataflow pipeline from Pub/Sub to GCS in Python

后端 未结 2 1868
谎友^
谎友^ 2020-12-20 18:21

I want to use Dataflow to move data from Pub/Sub to GCS. So basically I want Dataflow to accumulate some messages in a fixed amount of time (15 minutes for example), then wr

相关标签:
2条回答
  • 2020-12-20 18:51

    I ran into this same error, and found a workaround, but not a fix:

    TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'test-file-out/Write/WriteImpl/WriteBundles']
    

    running locally with DirectRunner and on dataflow with DataflowRunner.

    Reverting to apache-beam[gcp]==2.9.0 allows my pipeline to run as expected.

    0 讨论(0)
  • 2020-12-20 18:54

    I have had so much trouble trying to figure out the

    TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'generatedPtransform-1090']
    

    There seems to be something with the WriteToText after beam 2.9.0 (I am using beam 2.14.0, python 3.7)

    | "Output" >> beam.io.WriteToText("<GCS path or local path>"))
    

    What made it work for me was removing the pipeline part and radding a custom DoFn:

    class WriteToGCS(beam.DoFn):
        def __init__(self):
            self.outdir = "gs://<project>/<folder>/<file>"
    
        def process(self, element):
            from apache_beam.io.filesystems import FileSystems # needed here
            import json
            writer = FileSystems.create(self.outdir + '.csv', 'text/plain')
            writer.write(element)
            writer.close()
    
    
    

    and in the pipeline add:

    | 'Save file' >> beam.ParDo(WriteToGCS())
    
    0 讨论(0)
提交回复
热议问题