How to skip carriage returns in csv file while reading from cloud storage using google cloud dataflow in java

大憨熊 提交于 2019-12-02 22:54:34

问题


I have a CSV file which consists of new carriage returns (\n) in each row. While reading the CSV file from cloud storage using TextIO.read function of Apache beam it is considering \n as new record. how can i overcome this issue.

I have tried with by extending filebasedsource but it is reading only first line of the CSV file when we apply pTransorms.

help will be appreciated

Thanks in Advance


回答1:


TextIO can not do this - it always splits input based on carriage returns and is not aware of CSV-specific quoting of some of these carriage returns.

However, Beam 2.2 includes a transform that will make it very easy for you to write the CSV-specific (or any other file format specific reading) code yourself: FileIO. Do something like this:

p.apply(FileIO.match().filepattern("gs://..."))
 .apply(FileIO.readMatches())
 .apply(ParDo.of(new DoFn<ReadableFile, TableRow>() {
   @ProcessElement
   public void process(ProcessContext c) throws IOException {
     try (InputStream is = Channels.newInputStream(c.element().open())) {
       // ... Use your favorite Java CSV library ...
       ... c.output(next csv record) ...
     }
   }
 }))


来源:https://stackoverflow.com/questions/47668101/how-to-skip-carriage-returns-in-csv-file-while-reading-from-cloud-storage-using

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!