Split file with 800,000 columns

问题

I want to split a file of genomic data with 800,000 columns and 40,000 rows into a series of files with 100 columns each, total size 118GB.

I am currently running the following bash script, multithread 15 times:

infile="$1"
start=$2
end=$3
step=$(($4-1))

for((curr=$start, start=$start, end=$end; curr+step <= end; curr+=step+1)); do
  cut -f$curr-$((curr+step)) "$infile" > "${infile}.$curr" -d' '
done

However, judging by current progress of the script, it will take 300 days to complete the split?!

Is there a more efficient way to column wise split a space-delimited file into smaller chunks?

回答1:

Try this awk script:

awk -v cols=100 '{ 
     f = 1 
     for (i = 1; i <= NF; i++) {
       printf "%s%s", $i, (i % cols && i < NF ? OFS : ORS) > (FILENAME "." f)
       f=int(i/cols)+1
     }
  }' largefile

I expect it to be faster than the shell script in the question.

来源：https://stackoverflow.com/questions/40997710/split-file-with-800-000-columns

标签

bash

unix

awk

cut

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!