How to skip carriage returns in csv file while reading from cloud storage using google cloud dataflow in java

后端 未结 1 703
耶瑟儿~
耶瑟儿~ 2021-01-29 10:37

I have a CSV file which consists of new carriage returns (\\n) in each row. While reading the CSV file from cloud storage using TextIO.read function of Apache beam it is conside

1条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-29 11:19

    TextIO can not do this - it always splits input based on carriage returns and is not aware of CSV-specific quoting of some of these carriage returns.

    However, Beam 2.2 includes a transform that will make it very easy for you to write the CSV-specific (or any other file format specific reading) code yourself: FileIO. Do something like this:

    p.apply(FileIO.match().filepattern("gs://..."))
     .apply(FileIO.readMatches())
     .apply(ParDo.of(new DoFn() {
       @ProcessElement
       public void process(ProcessContext c) throws IOException {
         try (InputStream is = Channels.newInputStream(c.element().open())) {
           // ... Use your favorite Java CSV library ...
           ... c.output(next csv record) ...
         }
       }
     }))
    

    0 讨论(0)
提交回复
热议问题