Parse a csv using awk and ignoring commas inside a field

前端 未结 7 1133
抹茶落季
抹茶落季 2020-11-29 04:36

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings. <

相关标签:
7条回答
  • 2020-11-29 04:50

    You could try this awkbased csv paser:

    http://lorance.freeshell.org/csv/

    0 讨论(0)
  • 2020-11-29 04:50

    Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:

    0,"00BDF","AIRPORT TEST            "
    0,0,"BRICKER HALL<comma> JOHN W    "
    

    This gawk script (replace-comma.awk) does that:

    BEGIN { RS = "(.)" } 
    RT == "\x022" { inside++; } 
    { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }
    

    This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with <comma>.

    The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,

    § echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
    "Adams, John "
    § echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
    "Adams<comma> John ""Big Foot""",1
    

    As a one-liner for easy copy-paste:

    gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }'
    
    0 讨论(0)
  • 2020-11-29 04:51

    My workaround is to strip commas from the csv using:

    decommaize () {
      cat $1 | sed 's/"[^"]*"/"((&))"/g' | sed 's/\(\"((\"\)\([^",]*\)\(,\)\([^",]*\)\(\"))\"\)/"\2\4"/g' | sed 's/"(("/"/g' | sed 's/"))"/"/g' > $2
    }
    

    That is, first substitute opening quotes with "((" and closing quotes with "))", then substitute "(("whatever,whatever"))" with "whateverwhatever", then change all remaining instances of "((" and "))" back to ".

    0 讨论(0)
  • 2020-11-29 04:55

    The extra output you're getting from csv.awk is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.

    At the end of csv.awk is the { ... } loop which demonstrates one of the functions. It's that code that's outputting the -> 2|.

    Instead most of that, just call the parsing function and do print csv[1], csv[2].

    That part of the code would then look like:

    {
        num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
        if (num_fields < 0) {
            printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
        } else {
    #        printf "%s -> ", $0;
    #        printf "%s", num_fields;
    #        for (i = 0;i < num_fields;i++) {
    #            printf "|%s", csv[i];
    #        }
    #        printf "|\n";
            print csv[1], csv[2]
        }
    }
    

    Save it as your_script (for example).

    Do chmod +x your_script.

    And cat is unnecessary. Also, you can do sort -u instead of sort | uniq.

    Your command would then look like:

    ./yourscript Buildings.csv | sort -u > floors.csv
    
    0 讨论(0)
  • 2020-11-29 04:57

    Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

    perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "$f[0],$f[1]" }' file

    The input line is split into array @f
    Field 1 is $f[0] since Perl starts indexing at 0

    output:

    u_floor,u_room
    0,00BDF
    0,0
    0,3
    0,5
    0,6
    0,7
    0,8
    0,9
    0,19
    0,20
    0,21
    0,25
    0,27
    0,29
    0,35
    0,45
    0,59
    0,60
    0,61
    0,63
    0,0006M
    0,0008A
    0,0008B
    0,0008C
    0,0008D
    0,0008E
    0,0008F
    0,0008G
    0,0008H
    

    I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

    0 讨论(0)
  • 2020-11-29 04:58
    gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq
    

    This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)

    ETA (thanks mitchus): To remove the surrounding quotes, gsub("^\"|\"$","",$3); if there's more fields than just $3 to process that way, just loop through them.
    Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.

    0 讨论(0)
提交回复
热议问题