I have a tab-delimited file that has over 200 million lines. What\'s the fastest way in linux to convert this to a csv file? This file does have multiple lines of header i
@ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:
import sys
import csv
pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
commaout.writerow(row)
assuming you don't want to change header and assuming you don't have embedded tabs
# cat file
header header header
one two three
$ awk 'NR>1{$1=$1}1' OFS="," file
header header header
one,two,three
NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.
another way if you have blank columns and you care about that.
awk 'NR>1{gsub("\t",",")}1' file
using sed
sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)
I think it is better not to cat the file because it may create problem in the case of large file. The better way may be
$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt
The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt
If you want to convert the whole tsv file into a csv file:
$ cat data.tsv | tr "\\t" "," > data.csv
If you want to omit some fields:
$ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv
The above command will convert the data.tsv file to data.csv file containing only the first three fields.
the following awk oneliner supports quoting + quote-escaping
printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'
gives
"flop","flap""""