问题
Current beam pipeline is reading files as stream using FileIO.matchAll().continuously()
. This returns PCollection .
I want to write these files back with the same names to another gcs bucket i.e each PCollection
is one file metadata/readableFile
which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection
item back to GCS or are there any ways to do it ?
Is it possible to create a window per element and then use some GCS sink IO to be able to do this.
When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery
for a given window atomic and not partial in case of any failures ?
回答1:
Can you simply write a DoFn<ReadableFile, Void>
that takes the file and copies it to the desired location using the FileSystems
API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write()
, AvroIO.write()
etc.) are under the hood anyway: they are simply Beam transforms made of ParDo
's and GroupByKey
's.
来源:https://stackoverflow.com/questions/48237870/streaming-write-to-gcs-using-apache-beam-per-element