Shell Script for multithreading a process

后端未结

关注

 4  1632

I am a Bioinformatician and recently stuck in a problem which requires some scripting to speed up my process. We have a software called PHASE and Command that i type in my c

相关标签:

4条回答

再見小時候

2020-12-22 05:00
If you have GNU xargs, consider something like:
```
printf '%s\0' *.inp | xargs -0 -P 4 -n 1 \
  sh -c 'for f; do ./PHASE "$f" "${f%.inp}.out"' _
```
The -P 4 is important here, indicating the number of processes to run in parallel.

If you have a very large number of inputs and they're fast to process, consider replacing -n 1 with a larger number, to increase the number of inputs each shell instance iterates over -- decreasing shell startup costs, but also reducing granularity and, potentially, level of parallelism.

That said, if you really want to do batches of four (per your question), letting all four finish before starting the next four (which introduces some inefficiency, but is what you asked for), you could do something like this...
```
set -- *.inp                # set $@ to list of files matching *.imp
while (( $# )); do          # until we exhaust that list...
  for ((i=0; i<4; i++)); do # loop over batches of four...
    # as long as there's a next argument, start a process for it, and take it off the list
    [[ $1 ]] && ./PHASE "$1" "${1%.imp}.out" & shift
  done
  wait                      # ...and wait for running processes to finish before proceeding
done
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2020-12-22 05:04
My money is on GNU Parallel, rather than shell hackery! Nice term @William-Pursell !

It looks like this:
```
parallel ./PHASE test{1}.inp test{1}.out ::: {1..1000}
```
It is:
- easy to write
- easy to read
- performant
- flexible
If you want to run 16 jobs at a time, just add -j like this:
```
parallel -j 16 ./PHASE ...
```
If you want to get a progress report, just add -progress, like this:
```
parallel --progress ./PHASE ...
```
If you want to add a bunch of extra servers all around your network to speed things up, just add their IP addresses with -S, like this:
```
parallel -S meatyServer1 -S meatyServer1 ./PHASE ...
```
If you want a log of when processes were started and when they completed, just do this:
```
parallel --joblog $HOME/parallelLog.txt
```
If you want to add check-pointing so your jobs can be stopped and restarted, which you almost certainly should with 3,000 hours of processing, that is also easy. There are many variants, but for example, you could skip jobs whose corresponding output files already exist, so that if you restart, you immediately carry on where you left off. I would make a little bash function and do it like this:
```
#!/bin/bash

# Define a function for "GNU Parallel" to call
checkpointedPHASE() {
    ip="test${1}.inp"
    op="test${1}.out"
    # Skip job if already done
    if [ -f "$op" ]; then
       echo Skipping $1 ...
    else
       ./PHASE "$ip" "$op"
    fi
}
export -f checkpointedPHASE

# Now start parallel jobs
parallel checkpointedPHASE {1} ::: {1..1000}
```
You are in good company doing Bioinformatics with GNU Parallel - bioinformatics tutorial with GNU Parallel.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-22 05:08
"multi-threading" is the wrong word for what you are trying to do. You want to run multiple processes in parallel. Multi-threading refers to having multiple threads of execution running in the same process. Running all of the processes at once and letting the os schedule them for you has been mentioned, as has xargs -P, and you might want to look at gnu parallel. You can also hack a solution in the shell, but this has several issues (namely, it is not even remotely robust). The basic idea is to create a pipe and have each process write a token into the pipe when it is done. At the same time, you read the pipe and start up a new process whenever a token appears. For example:
```
#!/bin/bash 

n=${1-4}  # Use first arg as number of processes to run, default is 4

trap 'rm -vf /tmp/fifo' 0
rm -f /tmp/fifo
mkfifo /tmp/fifo

cmd() {
    ./PHASE test$1.inp test$1.out
    echo $1 > /tmp/fifo
}

# spawn first $n processes
yes | nl | sed ${n}q | while read num line; do
        cmd $num &
done


# Spawn a new process whenever a running process terminates
yes | nl | sed -e 1,${n}d -e 1000q | {
while read num line; do
        read -u 5 stub # wait for one to terminate
        cmd $num &
done 5< /tmp/fifo
wait
} &
exec 3> /tmp/fifo
wait
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-22 05:13
Bash does not support multi-threading, however it does support multi-processing. If you change your command to be:
```
for i in {1..1000}; do
    ./PHASE test$i.inp test$i.out &
done
```
This will run each one a process and your computer will automatically schedule them based on how many core you have. 1000 Processes will have a lot of overhead compared to threads, but while not ideal it should still be fine.

Edit: Here is a more advanced method if you want to prioritize getting progressive answers:
```
#!/bin/bash
# Number of cores and range end
n=4
e=1000

# This function will do the processing
process() {
    for ((i=$1; i <= $3; i += $2)); do
        ./PHASE test${i}.inp test${i}.out
        echo "Done $i"
    done
}

# For each core create a process and record the pid
for ((i=1; i <= n; i++)); do
    process $i $n $e &
done

# Wait for each process to complete
wait
```
0 讨论(0)
发布评论:

提交评论
- 加载中...