问题
I have an input CSV file containing something like:
SD-32MM-1001,"100.00",4/11/2012
SD-32MM-1001,"1,000.00",4/12/2012
I need to take out the formatting of numerical values for some other processing pipeline (postgresql COPY
).
Is there a text filter that will separate out the columns on FS without peeking inside quoted strings? Presently I get:
$ tail +2 /tmp/foo.csv|awk -F, '{print NF}'
3
4
And similarly partial values for cut
I have to stay on Linux.
Thanks.
回答1:
GNU awk can handle this you just need to set FPAT
to describe what you consider a field:
$ awk '{print NF}' FPAT="([^,]+)|(\"[^\"]+\")" file
3
3
$ awk '{print $2}' FPAT="([^,]+)|(\"[^\"]+\")" file
"100.00"
"1,000.00"
回答2:
Using a perl script and a proper parser (the good solution: awk & cut are not suited for this particular needs):
use strict; use warnings;
use Text::CSV;
my @rows;
my $csv = Text::CSV->new ()
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "/tmp/file.csv" or die "$!";
while ( my $row = $csv->getline( $fh ) ) {
# printing line 2, last field
$. == 2 and print $row->[-1];
}
$csv->eof or $csv->error_diag();
close $fh;
Output
4/12/2012
回答3:
The suggestions from sudo_O should work -- unless your fields have double quotes inside them, which can happen in standard CSV data, eg.
field1,field2,"field,3","field4 ""has some quotes"" in it",field5
To handle these, you can wrap the standard UNIX commands like cut,awk,etc with a program I wrote called csvquote like this:
csvquote /tmp/foo.csv | tail +2 | awk -F, '{print NF}'
This works by finding the commas inside quoted fields and replacing them temporarily with nonprinting characters that awk can safely handle. Then when you want to create output from the fields, the pipeline will need to restore those commas:
csvquote /tmp/foo.csv | cut -d, -f2 | csvquote -u
You can find the code here: https://github.com/dbro/csvquote
来源:https://stackoverflow.com/questions/16110601/how-to-get-awk-cut-etc-to-not-peek-for-fs-inside-quoted-strings