问题
Suppose I have a text file with records of the following form, where the FS
is generally speaking a comma, and the RS
is generally speaking a newline.
However, the exception to this rule is that if a field is in quotes, it should treat the line breaks and commas as part of the field.
"This field contains
line breaks and is
quoted but it
should be treated as a
single field",1,2,3,"another field"
How can I use awk to parse such a file correctly, where I can still access $1,$2...
, as I usually would, but with the above interpretation of fields?
I have already looked at this wiki page, but the solution presented there does not solve the problem of line breaks.
回答1:
A possible, although not perfect, solution is this: awk 'BEGIN{RS="\""}{...}'
. By doing this you reset the record separator to be "
, while the field separator remains a space. The problem is that this will add two empty record to your file, because also the first and last "
will be matched as delimiting some records.
Example:
awk 'BEGIN{RS="\""} {print $0,"END OF RECORD",$1,"-",$2}'
will produce this result when applied to your data
END OF RECORD -
This field contains
line breaks and is
quoted but it
should be treated as a
single field END OF RECORD This - field
,1,2,3, END OF RECORD ,1,2,3, -
another field END OF RECORD another - field
END OF RECORD -
You can skip the first one by adding the condtion NR>1
. The last one is a bit more tricky though, because you do not know how many records are there in your file. You can save the values you want to print in an array and print them using a for
cycle in the END
statement, skipping the first and last record in your file.
回答2:
To get awk to parse the file correctly, you can use a program I wrote called csvquote that temporarily replaces the commas and newlines that appear inside quoted fields with nonprinting characters that won't confuse awk. This program santizes the data into a format where awk can rely on a comma always representing a field separator, and a newline always representing a record separator.
To use it, you wrap your pipeline involving cut/awk/... like this:
csvquote /tmp/foo.csv | tail +2 | awk -F, '{print $3 $2}' | csvquote -u
You can find the code here: https://github.com/dbro/csvquote
The one caveat is that if you want to search for commas and newlines inside fields, this makes that task more complicated because you would need to search for the nonprinting characters instead. If you are looking for a way to do this more easily, you should look into the csvfix tools.
Another option is to use awk's FPAT, but that won't work if the fields contain escaped quotation marks. See http://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
回答3:
You can probably use double new line as record separator. If you also set comma as the field separator, then this allows you to handle each block of text as a field:
awk -v RS="\n\n" -v FS="," '...' file
For your given file, let's show the file number together with the file itself:
$ awk -v RS="\n\n" -v FS="," '{for (i=1; i<=NF; i++) print i, $i}' file
1 "This field contains
line breaks and is
quoted but it
should be treated as a
single field"
2 1
3 2
4 3
5 "another field"
来源:https://stackoverflow.com/questions/16094067/is-it-possible-to-handle-fields-containing-line-breaks-in-awk