Escaping separator within double quotes, in awk

时光毁灭记忆、已成空白 提交于 2019-11-26 13:44:48
Dimitre Radoulov

It's easy, with GNU awk 4:

zsh-4.3.12[t]% awk '{ 
 for (i = 0; ++i <= NF;)
   printf "field %d => %s\n", i, $i
 }' FPAT='([^,]+)|("[^"]+")' infile
field 1 => filed1
field 2 => filed2
field 3 => field3
field 4 => "field4,FOO,BAR"
field 5 => field5

Adding some comments as per OP requirement.

From the GNU awk manual on "Defining fields by content:

The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented above, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant, we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"

Using + twice, this does not work properly for empty fields, but it can be fixed as well:

As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:

FPAT = "([^,]*)|(\"[^\"]+\")"

FPAT works when there are newlines and commas inside the quoted fields, but not when there are double quotes, like this:

field1,"field,2","but this field has ""escaped"" quotes"

You can use a simple wrapper program I wrote called csvquote to make data easy for awk to interpret, and then restore the problematic special characters, like this:

csvquote inputfile.csv | awk -F, '{print $4}' | csvquote -u

See https://github.com/dbro/csvquote for code and docs

Chris Koknat

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

Suppose you only want to print the 4th field:

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "\"$f[3]\"" }' file

The input line is split into array @f
Field 4 is $f[3] since Perl starts indexing at 0

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!