问题
Pipeline Starts by Reading from PUBSUBIo. The message inside PubSub IO is a GCS file path. I know that I can use ReadAll()
to emit the lines from each path. However, it doesn't serve my purpose(Information regarding the file path is lost). What I need is to emit is a KV<'Filepath','Lines inside files'>
.
PubSUB messages will look like
Message1 -> gs://folder1/Topic1/topicfile1.gz
Message2 -> gs://folder1/Topic2/topicfile2.gz
Assume that the file contents are like below
topicfile1.gz
{
topic1.line1
topic1.line2
}
topicfile2.gz
{
topic2.line1
topic2.line2
}
What I am expecting is a pcollection like the one below
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line1'>}
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line2'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line1'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line2'>}
I could't find a way to read a file from a path inside the ParDo
function to map the path to the lines.
Hope this is clear.
回答1:
I don't think this is supported in TextIO
out of the box if I understood the question correctly.
Details
When you apply transforms like readAll()
there are a couple of steps involved between getting the initial file paths from the IO and emitting all the lines from all the files in the end.
For example, the logic in TextIO:
- it accepts a
PCollection
of file paths (or path patterns); - it applies
FileIO.matchAll()
that converts thePCollection
of path patterns intoPCollection
ofMatchResult.Metadata
objects that describe those paths; - then it applies the
FileIO.readMatches()
that converts the metadata objects intoReadableFile
objects that describe specific files; - and lastly it applies
TextIO.readFiles()
that takes in aReadableFile
and outputs all the strings from that file;- at this last step you would want to add a file path to the output, so that you know which string comes from which file. What would help if there was an option to change the last step to emit
KV<ReadableFile, String>
instead of just strings, so that you could access the file path usingReadableFile.metadata
.
- at this last step you would want to add a file path to the output, so that you know which string comes from which file. What would help if there was an option to change the last step to emit
Looking around that code it seems that emitting the raw lines from the files is the only supported way of doing things using TextIO
right now.
Workarounds
Probably the most straightforward way is to write your own PTransform
similar to TextIO.ReadAll
. This would work something like this:
High Level:
- Create and customize your own version
TextIO.ReadAll
; - And of
ReadAllViaFileBasedSource
; - Change your version of
ReadAllViaFileBasedSource
to emit what you want; - Use this custom version of
TextIO.ReadAll
that uses your custom versionReadAllViaFileBasedSource
that emits the correct things;
Slightly more detailed:
- just copy the whole TextIO.ReadAll, it's a pretty short wrapper for
FileIO
that implements the steps I mentioned above; - but in the
expand()
, at the last step, instead ofreadFiles()
you would apply a custom logic that will emit your desiredKVs
:- readFiles() right now is implemented by ReadAllViaFileBasedSource;
- ReadAllViaFileBasedSource seems to be the actual thing that converts ReadableFiles into strings;
- you would create a copy of
ReadAllViaFileBasedSource
and change the output logic, so that instead of just emitting the file it also emits the metadata;
来源:https://stackoverflow.com/questions/53876447/apache-beam-textio-readall-how-to-emit-keyvalue-instead-of-string-of-pcollection