Apache Beam TextIO.ReadAll How to emit KeyValue instead of String of Pcollection

荒凉一梦 提交于 2021-01-27 18:43:04

问题


Pipeline Starts by Reading from PUBSUBIo. The message inside PubSub IO is a GCS file path. I know that I can use ReadAll() to emit the lines from each path. However, it doesn't serve my purpose(Information regarding the file path is lost). What I need is to emit is a KV<'Filepath','Lines inside files'>.

PubSUB messages will look like

Message1 -> gs://folder1/Topic1/topicfile1.gz
Message2 -> gs://folder1/Topic2/topicfile2.gz

Assume that the file contents are like below

topicfile1.gz
{
topic1.line1
topic1.line2
}

topicfile2.gz
{
topic2.line1
topic2.line2
}

What I am expecting is a pcollection like the one below

{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line1'>}
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line2'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line1'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line2'>}

I could't find a way to read a file from a path inside the ParDo function to map the path to the lines.

Hope this is clear.


回答1:


I don't think this is supported in TextIO out of the box if I understood the question correctly.

Details

When you apply transforms like readAll() there are a couple of steps involved between getting the initial file paths from the IO and emitting all the lines from all the files in the end.

For example, the logic in TextIO:

  • it accepts a PCollection of file paths (or path patterns);
  • it applies FileIO.matchAll() that converts the PCollection of path patterns into PCollection of MatchResult.Metadata objects that describe those paths;
  • then it applies the FileIO.readMatches() that converts the metadata objects into ReadableFile objects that describe specific files;
  • and lastly it applies TextIO.readFiles() that takes in a ReadableFile and outputs all the strings from that file;
    • at this last step you would want to add a file path to the output, so that you know which string comes from which file. What would help if there was an option to change the last step to emit KV<ReadableFile, String> instead of just strings, so that you could access the file path using ReadableFile.metadata.

Looking around that code it seems that emitting the raw lines from the files is the only supported way of doing things using TextIO right now.

Workarounds

Probably the most straightforward way is to write your own PTransform similar to TextIO.ReadAll. This would work something like this:

High Level:

  • Create and customize your own version TextIO.ReadAll;
  • And of ReadAllViaFileBasedSource;
  • Change your version of ReadAllViaFileBasedSource to emit what you want;
  • Use this custom version of TextIO.ReadAll that uses your custom version ReadAllViaFileBasedSource that emits the correct things;

Slightly more detailed:

  • just copy the whole TextIO.ReadAll, it's a pretty short wrapper for FileIO that implements the steps I mentioned above;
  • but in the expand(), at the last step, instead of readFiles() you would apply a custom logic that will emit your desired KVs:
    • readFiles() right now is implemented by ReadAllViaFileBasedSource;
    • ReadAllViaFileBasedSource seems to be the actual thing that converts ReadableFiles into strings;
    • you would create a copy of ReadAllViaFileBasedSource and change the output logic, so that instead of just emitting the file it also emits the metadata;


来源:https://stackoverflow.com/questions/53876447/apache-beam-textio-readall-how-to-emit-keyvalue-instead-of-string-of-pcollection

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!