Regular expression to match CSV delimiters

后端 未结 6 961
天命终不由人
天命终不由人 2020-12-17 22:39

I\'m trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:

1,\"abcd\",         


        
相关标签:
6条回答
  • 2020-12-17 22:49

    As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.

    • I suggest the tool CSVFIX as likely to do what you need.

    To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):

    """",,"",a,"a,b"
    

    Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:

    "",,"",a",b c",
    

    The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!

    This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:

    use strict;
    use warnings;
    
    my @list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
    
    foreach my $string (@list)
    {
        print "Pattern: <<$string>>\n";
        while ($string =~ m/ (?: " ( (?:""|[^"])* ) "  |  ( [^,"] [^,]* )  |  ( .? ) )
                             (?: $ | , ) /gx)
        {
            print "Found QF: <<$1>>\n" if defined $1;
            print "Found PF: <<$2>>\n" if defined $2;
            print "Found EF: <<$3>>\n" if defined $3;
        }
    }
    

    Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!

    Output:

    Pattern:  <<"""",,"",a,"a,b">>
    Found QF: <<"">>
    Found EF: <<>>
    Found QF: <<>>
    Found PF: <<a>>
    Found QF: <<a,b>>
    Found EF: <<>>
    Pattern:  <<"",,"",a",b c",>>
    Found QF: <<>>
    Found EF: <<>>
    Found QF: <<>>
    Found PF: <<a">>
    Found PF: <<b c">>
    Found EF: <<>>
    

    We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct. Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.

    0 讨论(0)
  • 2020-12-17 22:56

    I know this is old, but this RegEx works for me:

    /(\"[^\"]+\")|[^,]+/g
    

    It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.

    Just sharing, maybe this will help someone.

    0 讨论(0)
  • 2020-12-17 23:04

    Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.

    Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:

    use Text::CSV;
    use feature 'say';
    
    my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
        or die "Cannot use CSV: ".Text::CSV->error_diag ();
    my $rows = $csv->getline_all(STDIN);
    for my $row (@$rows) {
        say join("\t", @$row);
    }
    

    That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?

    0 讨论(0)
  • 2020-12-17 23:07

    See my post that solves this problem for more detail.

    ^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.

    0 讨论(0)
  • 2020-12-17 23:11

    Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?

    Without context it's impossible to give a more specific solution.

    0 讨论(0)
  • 2020-12-17 23:12

    CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.

    What language are you using?

    0 讨论(0)
提交回复
热议问题