How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?

后端 未结 4 1175
闹比i
闹比i 2021-02-03 14:07

I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk, to convert these? I could use sed, like this,

相关标签:
4条回答
  • 2021-02-03 14:14

    Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk-based solution.


    To summarize the options (interestingly, they all perform about the same):

    tr:

    devnull's solution (provided in a comment on the question) is the simplest:

    tr '\t' ',' < file.tsv > file.csv
    

    sed:

    The OP's own sed solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t chars.):

    sed 's/\t/,/g' file.tsv > file.csv
    

    The only caveat is that on some platforms (e.g., macOS) the escape sequence \t is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'):

    sed 's/'$'\t''/,/g' file.tsv > file.csv
    

    awk:

    The caveat with awk is that FS - the input field separator - must be set to \t explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,:

    awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
    

    Note that simply assigning $1 to itself causes awk to rebuild the input line using OFS - the output field separator; this effectively replaces all \t chars. with , chars. print then simply prints the rebuilt line.


    Robust awk solution:

    As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain , characters correctly - you'll end up with extra CSV fields.

    The following awk solution fixes this, by enclosing such fields in "..." on demand (see the non-robust awk solution above for a partial explanation of the approach).

    If such fields also have embedded " chars., these are escaped as "", in line with RFC 4180.Thanks, Wyatt Israel.

    awk 'BEGIN { FS="\t"; OFS="," } {
      rebuilt=0
      for(i=1; i<=NF; ++i) {
        if ($i ~ /,/ && $i !~ /^".*"$/) { 
          gsub("\"", "\"\"", $i)
          $i = "\"" $i "\""
          rebuilt=1 
        }
      }
      if (!rebuilt) { $1=$1 }
      print
    }' file.tsv > file.csv
    
    • $i ~ /[,"]/ && $i !~ /^".*"$/ detects any field that contains , and/or " and isn't already enclosed in double quotes

    • gsub("\"", "\"\"", $i) escapes embedded " chars. by doubling them

    • $i = "\"" $i "\"" updates the result by enclosing it in double quotes

    • As stated before, updating any field causes awk to rebuild the line from the fields with the OFS value, i.e., , in this case, which amounts to the effective TSV -> CSV conversion; flag rebuilt is used to ensure that each input record is rebuilt at least once.

    0 讨论(0)
  • 2021-02-03 14:22

    Using awk works for me

    converting tsv to csv

    awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
    

    or converting csv to tsv

    awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv
    
    0 讨论(0)
  • 2021-02-03 14:31

    The tr command :

    tr '\t' ',' < file.tsv > file.csv
    

    is simple and gave absolutely correct and very quick results for me even on a really large file (approx 10 GB).

    0 讨论(0)
  • 2021-02-03 14:41

    This can also be achieved with Perl:

    In order to pipe the results to a new output file you can use the following:
    perl -wnlp -e 's/\t/,/g;' input_file.tsv > output_file.csv

    If you'd like to edit the file in place, you can invoke the -i option:
    perl -wnlpi -e 's/\t/,/g;' input_file.txt

    If by some chance you find that what you are dealing with is not actually tabs, but instead multiple spaces, you can use the following to replace each occurrence of two or more spaces with a comma:
    perl -wnlpi -e 's/\s+/,/g;' input_file

    Keep in mind that \s represents any whitespace character, including spaces, tabs or newlines and cannot be used in the replacement string.

    0 讨论(0)
提交回复
热议问题