I am a Bioinformatician and recently stuck in a problem which requires some scripting to speed up my process. We have a software called PHASE and Command that i type in my c
If you have GNU xargs, consider something like:
printf '%s\0' *.inp | xargs -0 -P 4 -n 1 \
sh -c 'for f; do ./PHASE "$f" "${f%.inp}.out"' _
The -P 4
is important here, indicating the number of processes to run in parallel.
If you have a very large number of inputs and they're fast to process, consider replacing -n 1
with a larger number, to increase the number of inputs each shell instance iterates over -- decreasing shell startup costs, but also reducing granularity and, potentially, level of parallelism.
That said, if you really want to do batches of four (per your question), letting all four finish before starting the next four (which introduces some inefficiency, but is what you asked for), you could do something like this...
set -- *.inp # set $@ to list of files matching *.imp
while (( $# )); do # until we exhaust that list...
for ((i=0; i<4; i++)); do # loop over batches of four...
# as long as there's a next argument, start a process for it, and take it off the list
[[ $1 ]] && ./PHASE "$1" "${1%.imp}.out" & shift
done
wait # ...and wait for running processes to finish before proceeding
done
My money is on GNU Parallel, rather than shell hackery! Nice term @William-Pursell !
It looks like this:
parallel ./PHASE test{1}.inp test{1}.out ::: {1..1000}
It is:
If you want to run 16 jobs at a time, just add -j
like this:
parallel -j 16 ./PHASE ...
If you want to get a progress report, just add -progress
, like this:
parallel --progress ./PHASE ...
If you want to add a bunch of extra servers all around your network to speed things up, just add their IP addresses with -S
, like this:
parallel -S meatyServer1 -S meatyServer1 ./PHASE ...
If you want a log of when processes were started and when they completed, just do this:
parallel --joblog $HOME/parallelLog.txt
If you want to add check-pointing so your jobs can be stopped and restarted, which you almost certainly should with 3,000 hours of processing, that is also easy. There are many variants, but for example, you could skip jobs whose corresponding output files already exist, so that if you restart, you immediately carry on where you left off. I would make a little bash function and do it like this:
#!/bin/bash
# Define a function for "GNU Parallel" to call
checkpointedPHASE() {
ip="test${1}.inp"
op="test${1}.out"
# Skip job if already done
if [ -f "$op" ]; then
echo Skipping $1 ...
else
./PHASE "$ip" "$op"
fi
}
export -f checkpointedPHASE
# Now start parallel jobs
parallel checkpointedPHASE {1} ::: {1..1000}
You are in good company doing Bioinformatics with GNU Parallel - bioinformatics tutorial with GNU Parallel.
"multi-threading" is the wrong word for what you are trying to do. You want to run multiple processes in parallel. Multi-threading refers to having multiple threads of execution running in the same process. Running all of the processes at once and letting the os schedule them for you has been mentioned, as has xargs -P
, and you might want to look at gnu parallel
. You can also hack a solution in the shell, but this has several issues (namely, it is not even remotely robust). The basic idea is to create a pipe and have each process write a token into the pipe when it is done. At the same time, you read the pipe and start up a new process whenever a token appears. For example:
#!/bin/bash
n=${1-4} # Use first arg as number of processes to run, default is 4
trap 'rm -vf /tmp/fifo' 0
rm -f /tmp/fifo
mkfifo /tmp/fifo
cmd() {
./PHASE test$1.inp test$1.out
echo $1 > /tmp/fifo
}
# spawn first $n processes
yes | nl | sed ${n}q | while read num line; do
cmd $num &
done
# Spawn a new process whenever a running process terminates
yes | nl | sed -e 1,${n}d -e 1000q | {
while read num line; do
read -u 5 stub # wait for one to terminate
cmd $num &
done 5< /tmp/fifo
wait
} &
exec 3> /tmp/fifo
wait
Bash does not support multi-threading, however it does support multi-processing. If you change your command to be:
for i in {1..1000}; do
./PHASE test$i.inp test$i.out &
done
This will run each one a process and your computer will automatically schedule them based on how many core you have. 1000 Processes will have a lot of overhead compared to threads, but while not ideal it should still be fine.
Edit: Here is a more advanced method if you want to prioritize getting progressive answers:
#!/bin/bash
# Number of cores and range end
n=4
e=1000
# This function will do the processing
process() {
for ((i=$1; i <= $3; i += $2)); do
./PHASE test${i}.inp test${i}.out
echo "Done $i"
done
}
# For each core create a process and record the pid
for ((i=1; i <= n; i++)); do
process $i $n $e &
done
# Wait for each process to complete
wait