Java CSV parser with unescaped quotes [closed]

随声附和 提交于 2019-11-29 22:35:09

问题


I have a CSV file that has some quoting issues:

"Albanese Confectionery","157137","ALBANESE BULK ASST. MINI WILD FRUIT WORMS 2" 4/5LB",9,90,0,0,0,.53,"21",50137,"3441851137","5 lb",1,4,4,$6.7,$6.7,$26.8

SuperCSV is choking on these fruit worms (pun intended). I know that the 2" should probably be 2"", but it's not. LibreOffice actually parses this correctly (which surprises me). I was thinking of just writing my own little parser but other rows have commas inside the string:

"Albanese Confectionery","157230","ALBANESE BULK JET FIGHTERS,ASSORTED 4/5  B",9,90,0,0,0,.53,"21",50230,"3441851230","5 lb",1,4,4,$6.7,$6.7,$26.8

Does anyone know of a Java library that will handle crazy stuff like this? Or should I try all the available ones? Or am I better off hacking this out myself?


回答1:


The right solution is to find the person who generated the data and beat them over the head with a keyboard until they fix the problem on their end.

Once you've exhausted that route, you could try some of the other CSV parsers on the market, I've used OpenCSV with success in the past.

Even if OpenCSV won't solve the problem out of the box, the code is fairly easy to read and available under an Apache license, so it might be possible to modify the algorithm to work with your wonky data, and probably easier than starting from scratch.




回答2:


Surprising even myself here, but I think I would hack it myself. I mean, you only need to read the lines and generate the tokens by splitting on quotes/commas, whichever you want. That way you can adjust the logic the way it suites you. It's not very hard. The file seems to be broken as much so that going through some existing solutions seems like more work.

One point though - if LibreOffice already parses it correctly, couldn't you just save the file from there, thus generating a file that is more reasonable. However, if you think LibreOffice might be guessing, just write the tokenizer yourself.




回答3:


+1 for the 'choking on fruit worms' pun - I nearly choked on my coffee reading that :)

If you really can't get that CSV fixed, then you could just supply your own Tokenizer (Super CSV is very flexible like that!).

You'd normally write your own readColumns() implementation, but it's quicker to extend the default Tokenizer and override the readLine() method to intercept the String (and fix the unescaped quotes) before it's tokenized.

I've made an assumption here that any quotes not next to a delimiter or at the start/end of the line should be escaped. It's far from perfect, but it works for your sample input. You can implement this however you like - it was too early in the morning for me to use a regex :)

This way you don't have to modify Super CSV at all (it just plugs in), so you get all of the other features like cell processors and bean mapping as well.

package org.supercsv;
import java.io.IOException;
import java.io.Reader;
import org.supercsv.io.Tokenizer;
import org.supercsv.prefs.CsvPreference;

public class FruitWormTokenizer extends Tokenizer {

  public FruitWormTokenizer(Reader reader, CsvPreference preferences) {
    super(reader, preferences);
  }

  @Override
  protected String readLine() throws IOException {
    final String line = super.readLine();
    if (line == null) {
      return null;
    }

    final char quote = (char) getPreferences().getQuoteChar();
    final char delimiter = (char) getPreferences().getDelimiterChar();

    // escape all quotes not next to a delimiter (or start/end of line)
    final StringBuilder b = new StringBuilder(line);
    for (int i = b.length() - 1; i >= 0; i--) {
      if (quote == b.charAt(i)) {
        final boolean validCharBefore = i - 1 < 0
            || b.charAt(i - 1) == delimiter;
        final boolean validCharAfter = i + 1 == b.length()
            || b.charAt(i + 1) == delimiter;
        if (!(validCharBefore || validCharAfter)) {
          // escape that quote!
          b.insert(i, quote);
        }
      }
    }
    return b.toString();
  }
}

You can just supply this Tokenizer to the constructor of your CsvReader.



来源:https://stackoverflow.com/questions/15210568/java-csv-parser-with-unescaped-quotes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!