fastest way convert tab-delimited file to csv in linux

前端 未结 11 1257
感情败类
感情败类 2020-12-04 07:56

I have a tab-delimited file that has over 200 million lines. What\'s the fastest way in linux to convert this to a csv file? This file does have multiple lines of header i

相关标签:
11条回答
  • 2020-12-04 08:21

    If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

    import sys
    import csv
    
    tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
    commaout = csv.writer(sys.stdout, dialect=csv.excel)
    for row in tabin:
      commaout.writerow(row)
    

    Run it from a shell as follows:

    python script.py < input.tsv > output.csv
    
    0 讨论(0)
  • 2020-12-04 08:25
    sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile
    

    Damn the critics, quote everything, CSV doesn't care.

    <tab> is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

    0 讨论(0)
  • 2020-12-04 08:26

    If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

    The blank space here is a literal tab:

    $ echo "hello   world" | tr "\\t" ","
    hello,world
    

    Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

    0 讨论(0)
  • 2020-12-04 08:28
    perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv
    

    Perl is generally faster at this sort of thing than the sed, awk, and Python.

    0 讨论(0)
  • 2020-12-04 08:32

    You can also use xsv for this

    xsv input -d '\t' input.tsv > output.csv
    

    In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).

    0 讨论(0)
  • 2020-12-04 08:32

    right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.

    0 讨论(0)
提交回复
热议问题