问题
I've got a line from a CSV file with "
as field encloser and ,
as field seperator as a string. Sometimes there are "
in the data that break the field enclosers. I'm looking for a regex to remove these "
.
My string looks like this:
my $csv = qq~"123456","024003","Stuff","","28" stuff with more stuff","2"," 1.99 ","",""~;
I've looked at this but I don't understand how to tell it to only remove quotes that are
- not at the beginning of the string
- not at the end of the string
- not preceded by a
,
- not followed by a
,
I managed to tell it to remove 3 and 4 at the same time with this line of code:
$csv =~ s/(?<!,)"(?!,)//g;
However, I cannot fit the ^
and $
in there since the lookahead and lookbehind both do not like being written as (?<!(^|,))
.
Is there a way to achieve this only with a regex besides splitting the string up and removing the quote from each element?
回答1:
This should work:
$csv =~ s/(?<=[^,])"(?=[^,])//g
1
and 2
implies that there must be at least one character before and after the comma, hence the positive lookarounds. 3
and 4
implies that these characters can be anything but a comma.
回答2:
For manipulating CSV data I'd reccomend using Text::CSV - there's a lot of potential complexity within CSV data, which while possible to contruct code to handle yourself, isn't worth the effort when there's a tried and tested CPAN module to do it for you
回答3:
Don't use Regex for parsing CSV file, CPAN provides lot of good modules like as nickifat suggest, use Text::CSV or you can use Text::ParseWords like
use Text::ParseWords;
while (<DATA>) {
chomp;
my @f = quotewords ',', 0, $_;
print join "|" => @f;
}
__DATA__
"123456","024003","Stuff","",""28" stuff with more stuff","2"," 1.99 ","",""
Output:
123456|024003|Stuff||28 stuff with more stuff|2| 1.99 ||
回答4:
Thanks for the help here. I was having issues with badly formatted CSV with embedded double-quotes. I would make one slight addition to the lookahead portion of the regex otherwise null values at the end of the line will be corrupted:
(?<=[^,])\"(?=[^,\n])
Adding the \n will eliminate a match against the last double-quote at end-of-line.
回答5:
the suggested
$csv =~ s/(?<=[^,])"(?=[^,])//g;
is probably the best answer. Without these advanced regex features, you could also do the same with
$csv =~ s/([^,])"([^,])/$1$2/g;
or
$csv = join (',', map {s/"//g;"\"$_\""} split (',', $csv));
I think you should be aware that your string is not well formated csv. In a csv file, double quotes inside values must be doubled (http://en.wikipedia.org/wiki/Comma-separated_values). With your format, values cannot contain quotes near commas.
csv is a not so simple format. If you decides to use "real" csv, you should use a module. Otherwise, you should probably remove all the double quotes in order to simplify your code and clarify that you are not doing csv.
来源:https://stackoverflow.com/questions/10446583/perl-regex-how-to-remove-quotes-inside-quotes-from-csv-line