问题
Any idea how I can get proper lines? some lines are getting glued, and I can't figure out how to stop it or why.
col. 0: Date
col. 1: Col2
col. 2: Col3
col. 3: Col4
col. 4: Col5
col. 5: Col6
col. 6: Col7
col. 7: Col7
col. 8: Col8
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: mcdonalds.com/online.html
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: burgerking.com
col. 11: https://burgerking.com/
col. 12: 20
col. 13: 2
col. 14: fake
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: wendys.com
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: buggagump.com
col. 11: null
col. 12: "","-""-""2017-05-23"
col. 13: String
col. 14: cheese
col. 15: ad eum
col. 16: mcdonalds.com/online.html
col. 17: null
col. 18: "","-""-""2017-05-23"
col. 19: String
col. 20: burger
col. 21: ludus dissentiet
col. 22: www.mcdonalds.com
col. 23: https://www.mcdonalds.com/
col. 24: 25
col. 25: 3
col. 26: fake
col. 0: 2017-05-23
col. 1: String
col. 2: wine
col. 3: id erat utamur
col. 4: bubbagump.com
col. 5: https://buggagump.com/
col. 6: 25
col. 7: 3
col. 8: fake
done
A sample CSV (the \r\n may have gotten corrupted when copy/pasting). Available here: https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0
"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"
Building settings:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setQuoteDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
settings.getFormat().setLineSeparator("\r\n");
CsvParser parser = new CsvParser(settings);
List<String[]> rows;
rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));
for (String[] row : rows)
{
System.out.println("");
int i = 0;
for (String element : row)
{
System.out.println("col. " + i++ + ": " + element);
}
}
System.out.println("done");
回答1:
As you are testing the auto-detection process, I suggest you to print out the detected format with:
CsvFormat format = parser.getDetectedFormat();
System.out.println(format);
This will print out:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=-
Quote escape escape character=null
As you can see, the parser is not detecting the quote escape correctly. While the format detection process is typically very good, it is not guaranteed that it will always get it right, specially with small test samples. In your sample I can't see why it would pick up the -
as the escape character, so I opened this issue to investigate and see what is making it detect that one.
What you can do right now as a workaround, if you know for a fact that none of your input files will never have -
as the quote escape, is to detect the format, test what it picked up from the input, and then parse the contents, like this:
public List<String[]> parse(File input, CsvFormat format) {
CsvParserSettings settings = new CsvParserSettings();
if (format == null) { //no format specified? Let's detect what we are dealing with
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
format = parser.getDetectedFormat(); //capture the format
parser.stopParsing(); //stop the parser - no need to read anything yet.
System.out.println(format);
if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
format.setQuoteEscape('"');
}
return parse(input, format); //now parse with the intended format
} else {
settings.setFormat(format); //this parses with the format adjusted earlier.
CsvParser parser = new CsvParser(settings);
return parser.parseAll(input);
}
}
Now just call the parse
method:
List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);
And you will have your data properly extracted. Hope this helps!
来源:https://stackoverflow.com/questions/44208137/handling-csv-with-univocity