问题
I want to split a file of genomic data with 800,000 columns and 40,000 rows into a series of files with 100 columns each, total size 118GB.
I am currently running the following bash script, multithread 15 times:
infile="$1"
start=$2
end=$3
step=$(($4-1))
for((curr=$start, start=$start, end=$end; curr+step <= end; curr+=step+1)); do
cut -f$curr-$((curr+step)) "$infile" > "${infile}.$curr" -d' '
done
However, judging by current progress of the script, it will take 300 days to complete the split?!
Is there a more efficient way to column wise split a space-delimited file into smaller chunks?
回答1:
Try this awk script:
awk -v cols=100 '{
f = 1
for (i = 1; i <= NF; i++) {
printf "%s%s", $i, (i % cols && i < NF ? OFS : ORS) > (FILENAME "." f)
f=int(i/cols)+1
}
}' largefile
I expect it to be faster than the shell script in the question.
来源:https://stackoverflow.com/questions/40997710/split-file-with-800-000-columns