Parse a csv using awk and ignoring commas inside a field

前端未结

关注

 7  1133

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings. <

相关标签:

7条回答

终归单人心

2020-11-29 04:50

You could try this awkbased csv paser:

http://lorance.freeshell.org/csv/

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-29 04:50
Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:
```
0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL<comma> JOHN W    "
```
This gawk script (replace-comma.awk) does that:
```
BEGIN { RS = "(.)" } 
RT == "\x022" { inside++; } 
{ if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }
```
This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with <comma>.

The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,
```
§ echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
"Adams, John "
§ echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
"Adams<comma> John ""Big Foot""",1
```
As a one-liner for easy copy-paste:
```
gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-29 04:51
My workaround is to strip commas from the csv using:
```
decommaize () {
  cat $1 | sed 's/"[^"]*"/"((&))"/g' | sed 's/$\"((\"$$[^",]*$$,$$[^",]*$$\"))\"$/"\2\4"/g' | sed 's/"(("/"/g' | sed 's/"))"/"/g' > $2
}
```
That is, first substitute opening quotes with "((" and closing quotes with "))", then substitute "(("whatever,whatever"))" with "whateverwhatever", then change all remaining instances of "((" and "))" back to ".
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-29 04:55
The extra output you're getting from csv.awk is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.

At the end of csv.awk is the { ... } loop which demonstrates one of the functions. It's that code that's outputting the -> 2|.

Instead most of that, just call the parsing function and do print csv[1], csv[2].

That part of the code would then look like:
```
{
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
    if (num_fields < 0) {
        printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
    } else {
#        printf "%s -> ", $0;
#        printf "%s", num_fields;
#        for (i = 0;i < num_fields;i++) {
#            printf "|%s", csv[i];
#        }
#        printf "|\n";
        print csv[1], csv[2]
    }
}
```
Save it as your_script (for example).

Do chmod +x your_script.

And cat is unnecessary. Also, you can do sort -u instead of sort | uniq.

Your command would then look like:
```
./yourscript Buildings.csv | sort -u > floors.csv
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-11-29 04:57
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "$f[0],$f[1]" }' file

The input line is split into array @f
Field 1 is $f[0] since Perl starts indexing at 0

output:
```
u_floor,u_room
0,00BDF
0,0
0,3
0,5
0,6
0,7
0,8
0,9
0,19
0,20
0,21
0,25
0,27
0,29
0,35
0,45
0,59
0,60
0,61
0,63
0,0006M
0,0008A
0,0008B
0,0008C
0,0008D
0,0008E
0,0008F
0,0008G
0,0008H
```
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-11-29 04:58
```
gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq
```
This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)

ETA (thanks mitchus): To remove the surrounding quotes, gsub("^\"|\"$","",$3); if there's more fields than just $3 to process that way, just loop through them.
Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页