Is it possible to handle fields containing line breaks in awk?

问题

Suppose I have a text file with records of the following form, where the FS is generally speaking a comma, and the RS is generally speaking a newline.

However, the exception to this rule is that if a field is in quotes, it should treat the line breaks and commas as part of the field.

"This field contains
line breaks and is
quoted but it 
should be treated as a 
single field",1,2,3,"another field"

How can I use awk to parse such a file correctly, where I can still access $1,$2..., as I usually would, but with the above interpretation of fields?

I have already looked at this wiki page, but the solution presented there does not solve the problem of line breaks.

回答1:

A possible, although not perfect, solution is this: awk 'BEGIN{RS="\""}{...}'. By doing this you reset the record separator to be ", while the field separator remains a space. The problem is that this will add two empty record to your file, because also the first and last " will be matched as delimiting some records.

Example:

awk 'BEGIN{RS="\""}  {print $0,"END OF RECORD",$1,"-",$2}'

will produce this result when applied to your data

 END OF RECORD  - 
This field contains
line breaks and is
quoted but it 
should be treated as a 
single field END OF RECORD This - field
,1,2,3, END OF RECORD ,1,2,3, - 
another field END OF RECORD another - field

END OF RECORD  -

You can skip the first one by adding the condtion NR>1 . The last one is a bit more tricky though, because you do not know how many records are there in your file. You can save the values you want to print in an array and print them using a for cycle in the END statement, skipping the first and last record in your file.

回答2:

To get awk to parse the file correctly, you can use a program I wrote called csvquote that temporarily replaces the commas and newlines that appear inside quoted fields with nonprinting characters that won't confuse awk. This program santizes the data into a format where awk can rely on a comma always representing a field separator, and a newline always representing a record separator.

To use it, you wrap your pipeline involving cut/awk/... like this:

csvquote /tmp/foo.csv | tail +2 | awk -F, '{print $3 $2}' | csvquote -u

You can find the code here: https://github.com/dbro/csvquote

The one caveat is that if you want to search for commas and newlines inside fields, this makes that task more complicated because you would need to search for the nonprinting characters instead. If you are looking for a way to do this more easily, you should look into the csvfix tools.

Another option is to use awk's FPAT, but that won't work if the fields contain escaped quotation marks. See http://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

回答3:

You can probably use double new line as record separator. If you also set comma as the field separator, then this allows you to handle each block of text as a field:

awk -v RS="\n\n" -v FS="," '...' file

For your given file, let's show the file number together with the file itself:

$ awk -v RS="\n\n" -v FS="," '{for (i=1; i<=NF; i++) print i, $i}' file
1 "This field contains
line breaks and is
quoted but it 
should be treated as a 
single field"
2 1
3 2
4 3
5 "another field"

来源：https://stackoverflow.com/questions/16094067/is-it-possible-to-handle-fields-containing-line-breaks-in-awk

标签

text

csv

awk

newline

gawk