I have a CSV file which consists of new carriage returns (\\n) in each row. While reading the CSV file from cloud storage using TextIO.read function of Apache beam it is conside
TextIO
can not do this - it always splits input based on carriage returns and is not aware of CSV-specific quoting of some of these carriage returns.
However, Beam 2.2 includes a transform that will make it very easy for you to write the CSV-specific (or any other file format specific reading) code yourself: FileIO
. Do something like this:
p.apply(FileIO.match().filepattern("gs://..."))
.apply(FileIO.readMatches())
.apply(ParDo.of(new DoFn<ReadableFile, TableRow>() {
@ProcessElement
public void process(ProcessContext c) throws IOException {
try (InputStream is = Channels.newInputStream(c.element().open())) {
// ... Use your favorite Java CSV library ...
... c.output(next csv record) ...
}
}
}))