regular expression should split , that are contained outside the double quotes in a CSV file?

前端 未结 4 795
情话喂你
情话喂你 2021-01-26 10:00

This is the sample

\"abc\",\"abcsds\",\"adbc,ds\",\"abc\"

Output should be

abc
abcsds
adbc,ds
abc
相关标签:
4条回答
  • 2021-01-26 10:21

    If you can be sure there are no inner, escaped quotes, then I guess it's ok to use a regular expression for this. However, most modern languages already have proper CSV parsers.

    Use a proper parser is the correct answer to this. Text::CSV for Perl, for example.

    However, if you're dead set on using regular expressions, I'd suggest you "borrow" from some sort of module, like this one: http://metacpan.org/pod/Regexp::Common::balanced

    0 讨论(0)
  • 2021-01-26 10:22

    This is a tougher job than you realize -- not only can there be commas inside the quotes, but there can also be quotes inside the quotes. Two consecutive quotes inside of a quoted string does not signal the end of the string. Instead, it signals a quote embedded in the string, so for example:

    "x", "y,""z"""
    

    should be parsed as:

    x
    y,"z"
    

    So, the basic sequence is something like this:

    Find the first non-white-space character.
    If it was a quote, read up to the next quote. Then read the next character.
        Repeat until that next character is not also a quote.
        If the next (non-whitespace) character is not a comma, input is malformed.
    If it was not a quote, read up to the next comma.
    Skip the comma, repeat the whole process for the next field.
    

    Note that despite the tag, I'm not providing a regex -- I'm not at all sure I've seen a regex that can really handle this properly.

    0 讨论(0)
  • 2021-01-26 10:32

    This answer has a C# solution for dealing with CSV.

    In particular, the line

    private static Regex rexCsvSplitter = new Regex( @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
    

    contains the Regex used to split properly, i.e., taking quoting and escaping into consideration.

    Basically what it says is, match any comma that is followed by an even number of quote marks (including zero). This effectively prevents matching a comma that is part of a quoted string, since the quote character is escaped by doubling it.

    Keep in mind that the quotes in the above line are doubled for the sake of the string literal. It might be easier to think of the expression as

    ,(?=(?:[^"]*"[^"]*")*(?![^"]*"))
    
    0 讨论(0)
  • 2021-01-26 10:34

    Try this:

    "(.*?)"
    

    if you need to put this regex inside a literal, don't forget to escape it:

    Regex re = new Regex("\"(.*?)\"");
    
    0 讨论(0)
提交回复
热议问题