I have a tab-delimited file that has over 200 million lines. What\'s the fastest way in linux to convert this to a csv file? This file does have multiple lines of header i
If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
commaout.writerow(row)
Run it from a shell as follows:
python script.py < input.tsv > output.csv
sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile
Damn the critics, quote everything, CSV doesn't care.
<tab>
is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.
If all you need to do is translate all tab characters to comma characters, tr
is probably the way to go.
The blank space here is a literal tab:
$ echo "hello world" | tr "\\t" ","
hello,world
Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.
perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv
Perl is generally faster at this sort of thing than the sed, awk, and Python.
You can also use xsv for this
xsv input -d '\t' input.tsv > output.csv
In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).
right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.