Handling “”, “-” CSV with Univocity

问题

Any idea how I can get proper lines? some lines are getting glued, and I can't figure out how to stop it or why.

  col. 0: Date
  col. 1: Col2
  col. 2: Col3
  col. 3: Col4
  col. 4: Col5
  col. 5: Col6
  col. 6: Col7
  col. 7: Col7
  col. 8: Col8

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: mcdonalds.com/online.html
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: burgerking.com
  col. 11: https://burgerking.com/
  col. 12: 20
  col. 13: 2
  col. 14: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: wendys.com
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: buggagump.com
  col. 11: null
  col. 12: "","-""-""2017-05-23"
  col. 13: String
  col. 14: cheese
  col. 15: ad eum
  col. 16: mcdonalds.com/online.html
  col. 17: null
  col. 18: "","-""-""2017-05-23"
  col. 19: String
  col. 20: burger
  col. 21: ludus dissentiet
  col. 22: www.mcdonalds.com
  col. 23: https://www.mcdonalds.com/
  col. 24: 25
  col. 25: 3
  col. 26: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: wine
  col. 3: id erat utamur
  col. 4: bubbagump.com
  col. 5: https://buggagump.com/
  col. 6: 25
  col. 7: 3
  col. 8: fake
  done

A sample CSV (the \r\n may have gotten corrupted when copy/pasting). Available here: https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0

"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"

Building settings:

  CsvParserSettings settings = new CsvParserSettings();

  settings.setDelimiterDetectionEnabled(true);
  settings.setQuoteDetectionEnabled(true);

  settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
  settings.getFormat().setLineSeparator("\r\n");

  CsvParser parser = new CsvParser(settings);

  List<String[]> rows;

  rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));

  for (String[] row : rows)
  {
    System.out.println("");
    int i = 0;

    for (String element : row)
    {
      System.out.println("col. " + i++ + ": " + element);
    }
  }

  System.out.println("done");

回答1:

As you are testing the auto-detection process, I suggest you to print out the detected format with:

CsvFormat format = parser.getDetectedFormat();
System.out.println(format);

This will print out:

CsvFormat:
    Comment character=#
    Field delimiter=,
    Line separator (normalized)=\n
    Line separator sequence=\r\n
    Quote character="
    Quote escape character=-
    Quote escape escape character=null

As you can see, the parser is not detecting the quote escape correctly. While the format detection process is typically very good, it is not guaranteed that it will always get it right, specially with small test samples. In your sample I can't see why it would pick up the - as the escape character, so I opened this issue to investigate and see what is making it detect that one.

What you can do right now as a workaround, if you know for a fact that none of your input files will never have - as the quote escape, is to detect the format, test what it picked up from the input, and then parse the contents, like this:

public List<String[]> parse(File input, CsvFormat format) {
    CsvParserSettings settings = new CsvParserSettings();
    if (format == null) { //no format specified? Let's detect what we are dealing with
        settings.detectFormatAutomatically();

        CsvParser parser = new CsvParser(settings);
        parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
        format = parser.getDetectedFormat(); //capture the format
        parser.stopParsing(); //stop the parser - no need to read anything yet.

        System.out.println(format);

        if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
            format.setQuoteEscape('"');
        }

        return parse(input, format); //now parse with the intended format
    } else {
        settings.setFormat(format); //this parses with the format adjusted earlier.
        CsvParser parser = new CsvParser(settings);
        return parser.parseAll(input);
    }

}

Now just call the parse method:

List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);

And you will have your data properly extracted. Hope this helps!

来源：https://stackoverflow.com/questions/44208137/handling-csv-with-univocity

标签

java

csv

parsing

univocity