How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?

后端 未结 4 1174
闹比i
闹比i 2021-02-03 14:07

I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk, to convert these? I could use sed, like this,

4条回答
  •  失恋的感觉
    2021-02-03 14:14

    Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk-based solution.


    To summarize the options (interestingly, they all perform about the same):

    tr:

    devnull's solution (provided in a comment on the question) is the simplest:

    tr '\t' ',' < file.tsv > file.csv
    

    sed:

    The OP's own sed solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t chars.):

    sed 's/\t/,/g' file.tsv > file.csv
    

    The only caveat is that on some platforms (e.g., macOS) the escape sequence \t is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'):

    sed 's/'$'\t''/,/g' file.tsv > file.csv
    

    awk:

    The caveat with awk is that FS - the input field separator - must be set to \t explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,:

    awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
    

    Note that simply assigning $1 to itself causes awk to rebuild the input line using OFS - the output field separator; this effectively replaces all \t chars. with , chars. print then simply prints the rebuilt line.


    Robust awk solution:

    As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain , characters correctly - you'll end up with extra CSV fields.

    The following awk solution fixes this, by enclosing such fields in "..." on demand (see the non-robust awk solution above for a partial explanation of the approach).

    If such fields also have embedded " chars., these are escaped as "", in line with RFC 4180.Thanks, Wyatt Israel.

    awk 'BEGIN { FS="\t"; OFS="," } {
      rebuilt=0
      for(i=1; i<=NF; ++i) {
        if ($i ~ /,/ && $i !~ /^".*"$/) { 
          gsub("\"", "\"\"", $i)
          $i = "\"" $i "\""
          rebuilt=1 
        }
      }
      if (!rebuilt) { $1=$1 }
      print
    }' file.tsv > file.csv
    
    • $i ~ /[,"]/ && $i !~ /^".*"$/ detects any field that contains , and/or " and isn't already enclosed in double quotes

    • gsub("\"", "\"\"", $i) escapes embedded " chars. by doubling them

    • $i = "\"" $i "\"" updates the result by enclosing it in double quotes

    • As stated before, updating any field causes awk to rebuild the line from the fields with the OFS value, i.e., , in this case, which amounts to the effective TSV -> CSV conversion; flag rebuilt is used to ensure that each input record is rebuilt at least once.

提交回复
热议问题