What's the most robust way to efficiently parse CSV using awk?

后端 未结 2 1965
花落未央
花落未央 2020-11-21 06:16

The intent of this question is to provide a canonical answer.

Given a CSV as might be generated by Excel or other tools with embedded newlines, embedded double quotes

2条回答
  •  天涯浪人
    2020-11-21 06:52

    An improvement upon @EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).

    gawk -v FPAT='[^,]*|("[^"]*")+' ...
    

    This STILL

    1. isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.

    2. assumes GNU awk (gawk), a standard awk won't do.

    Example:

    $ echo 'a,,"","y""ck","""x,y,z"," ",12' |
    gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
    a||""|"y""ck"|"""x,y,z"|" "|12
    
    $ echo 'a,,"","y""ck","""x,y,z"," ",12' |
    gawk -v FPAT='[^,]*|("[^"]*")+' '{
      for(i=1; i<=NF;i++){
        if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
        print "<"$i">"
      }
    }'
    
    <>
    <>
    
    <"x,y,z>
    < >
    <12>
    

提交回复
热议问题