问题
I wrote one shell program which divide the files in 4 parts automatically using csplit
and then four shell program which execute same command in background using nohup
and one while loop will look for the completion of these four processes and finally cat
output1.txt ....output4.txt > finaloutput.txt
But then i came to know about this command parallel
and i tried this with big file but looks like it is not working as expected. This file is an output of below command -
for i in $(seq 1 1000000);do cat /etc/passwd >> data.txt1;done
time wc -l data.txt1
10000000 data.txt1
real 0m0.507s
user 0m0.080s
sys 0m0.424s
with parallel
time cat data.txt1 | parallel --pipe wc -l | awk '{s+=$1} END {print s}'
10000000
real 0m41.984s
user 0m1.122s
sys 0m36.251s
And when i tried this for 2GB file(~10million) records it took more than 20 minutes.
Does this command only work on multi core system(I am using single core system currently)
nproc --all
1
回答1:
--pipe
is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).
--pipepart
is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1
. It will split data.txt1
in into one block per cpu core and feed those blocks into a wc -l
running on each core:
parallel --block -1 --pipepart -a data.txt1 wc -l
You need version 20161222 or later for block -1
to work.
These are timings from my old dual core laptop. seq 200000000
generates 1.8 GB of data.
$ time seq 200000000 | LANG=C wc -c
1888888898
real 0m7.072s
user 0m3.612s
sys 0m2.444s
$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 1m28.101s
user 0m25.892s
sys 0m40.672s
The time here is mostly due to GNU Parallel spawning a new wc -c
for each 1 MB block. Increasing the block size makes it faster:
$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m26.269s
user 0m8.988s
sys 0m11.920s
$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m21.628s
user 0m7.636s
sys 0m9.516s
As mentioned --pipepart
is much faster if you have data in a file:
$ seq 200000000 > data.txt1
$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m2.242s
user 0m0.424s
sys 0m2.880s
So on my old laptop I can process 1.8 GB in 2.2 seconds.
If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).
However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.
回答2:
In short yes.. You will need more physical cores on the machines to get benefit from the parallel. Just for understanding your task ; following is what you intend to do
file1 is a 10,000,000 line file
split into 4 files >
file1.1 > processing > output1
file1.2 > processing > output2
file1.3 > processing > output3
file1.4 > processing > output4
>> cat output* > output
________________________________
And You want to parallelize the middle part and run it on 4 cores (hopefully 4 cores) simultaneously. Am I correct? I think you can use GNU parallel in much better way write a code for 1 of the files and use that command with (psuedocode warning )
parallel --jobs 4 "processing code on the file segments with sequence variable {}" ::: 1 2 3 4
Where -j is for number of processors.
UPDATE Why are you trying parallel command for sequential execution within your file1.1 1.2 1.3 and 1.4?? Let it be regular sequential processing as you have coded
parallel 'for i in $(seq 1 250000);do cat file1.{} >> output{}.txt;done' ::: 1 2 3 4
The above code will run your 4 segmented files from csplit in parallel on 4 cores
for i in $(seq 1 250000);do cat file1.1 >> output1.txt;done
for i in $(seq 1 250000);do cat file1.2 >> output2.txt;done
for i in $(seq 1 250000);do cat file1.3 >> output3.txt;done
for i in $(seq 1 250000);do cat file1.4 >> output4.txt;done
I am pretty sure that --diskpart as suggested above by Ole is the better way to do it ; given that you have high speed data access from HDD.
来源:https://stackoverflow.com/questions/41921522/parallel-execution-of-unix-command