Perl Regex: How to remove quotes inside quotes from CSV line

亡梦爱人 提交于 2019-12-10 18:34:47

问题


I've got a line from a CSV file with " as field encloser and , as field seperator as a string. Sometimes there are " in the data that break the field enclosers. I'm looking for a regex to remove these ".

My string looks like this:

my $csv = qq~"123456","024003","Stuff","","28" stuff with more stuff","2"," 1.99 ","",""~;

I've looked at this but I don't understand how to tell it to only remove quotes that are

  1. not at the beginning of the string
  2. not at the end of the string
  3. not preceded by a ,
  4. not followed by a ,

I managed to tell it to remove 3 and 4 at the same time with this line of code:

$csv =~ s/(?<!,)"(?!,)//g;

However, I cannot fit the ^ and $ in there since the lookahead and lookbehind both do not like being written as (?<!(^|,)).

Is there a way to achieve this only with a regex besides splitting the string up and removing the quote from each element?


回答1:


This should work:

$csv =~ s/(?<=[^,])"(?=[^,])//g

1 and 2 implies that there must be at least one character before and after the comma, hence the positive lookarounds. 3 and 4 implies that these characters can be anything but a comma.




回答2:


For manipulating CSV data I'd reccomend using Text::CSV - there's a lot of potential complexity within CSV data, which while possible to contruct code to handle yourself, isn't worth the effort when there's a tried and tested CPAN module to do it for you




回答3:


Don't use Regex for parsing CSV file, CPAN provides lot of good modules like as nickifat suggest, use Text::CSV or you can use Text::ParseWords like

use Text::ParseWords;  
while (<DATA>) {
chomp;     
my @f = quotewords ',', 0, $_;     
print join "|" => @f; 
}  

__DATA__ 
"123456","024003","Stuff","",""28" stuff with more stuff","2"," 1.99 ","","" 

Output:

123456|024003|Stuff||28 stuff with more stuff|2| 1.99 || 



回答4:


Thanks for the help here. I was having issues with badly formatted CSV with embedded double-quotes. I would make one slight addition to the lookahead portion of the regex otherwise null values at the end of the line will be corrupted:

(?<=[^,])\"(?=[^,\n])

Adding the \n will eliminate a match against the last double-quote at end-of-line.




回答5:


the suggested

$csv =~ s/(?<=[^,])"(?=[^,])//g;

is probably the best answer. Without these advanced regex features, you could also do the same with

$csv =~ s/([^,])"([^,])/$1$2/g;

or

$csv = join (',', map {s/"//g;"\"$_\""} split (',', $csv));

I think you should be aware that your string is not well formated csv. In a csv file, double quotes inside values must be doubled (http://en.wikipedia.org/wiki/Comma-separated_values). With your format, values cannot contain quotes near commas.

csv is a not so simple format. If you decides to use "real" csv, you should use a module. Otherwise, you should probably remove all the double quotes in order to simplify your code and clarify that you are not doing csv.



来源:https://stackoverflow.com/questions/10446583/perl-regex-how-to-remove-quotes-inside-quotes-from-csv-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!