Shell Script for multithreading a process

后端 未结 4 1632
失恋的感觉
失恋的感觉 2020-12-22 04:07

I am a Bioinformatician and recently stuck in a problem which requires some scripting to speed up my process. We have a software called PHASE and Command that i type in my c

相关标签:
4条回答
  • 2020-12-22 05:00

    If you have GNU xargs, consider something like:

    printf '%s\0' *.inp | xargs -0 -P 4 -n 1 \
      sh -c 'for f; do ./PHASE "$f" "${f%.inp}.out"' _
    

    The -P 4 is important here, indicating the number of processes to run in parallel.

    If you have a very large number of inputs and they're fast to process, consider replacing -n 1 with a larger number, to increase the number of inputs each shell instance iterates over -- decreasing shell startup costs, but also reducing granularity and, potentially, level of parallelism.


    That said, if you really want to do batches of four (per your question), letting all four finish before starting the next four (which introduces some inefficiency, but is what you asked for), you could do something like this...

    set -- *.inp                # set $@ to list of files matching *.imp
    while (( $# )); do          # until we exhaust that list...
      for ((i=0; i<4; i++)); do # loop over batches of four...
        # as long as there's a next argument, start a process for it, and take it off the list
        [[ $1 ]] && ./PHASE "$1" "${1%.imp}.out" & shift
      done
      wait                      # ...and wait for running processes to finish before proceeding
    done
    
    0 讨论(0)
  • 2020-12-22 05:04

    My money is on GNU Parallel, rather than shell hackery! Nice term @William-Pursell !

    It looks like this:

    parallel ./PHASE test{1}.inp test{1}.out ::: {1..1000}
    

    It is:

    • easy to write
    • easy to read
    • performant
    • flexible

    If you want to run 16 jobs at a time, just add -j like this:

    parallel -j 16 ./PHASE ...
    

    If you want to get a progress report, just add -progress, like this:

    parallel --progress ./PHASE ...
    

    If you want to add a bunch of extra servers all around your network to speed things up, just add their IP addresses with -S, like this:

    parallel -S meatyServer1 -S meatyServer1 ./PHASE ...
    

    If you want a log of when processes were started and when they completed, just do this:

    parallel --joblog $HOME/parallelLog.txt
    

    If you want to add check-pointing so your jobs can be stopped and restarted, which you almost certainly should with 3,000 hours of processing, that is also easy. There are many variants, but for example, you could skip jobs whose corresponding output files already exist, so that if you restart, you immediately carry on where you left off. I would make a little bash function and do it like this:

    #!/bin/bash
    
    # Define a function for "GNU Parallel" to call
    checkpointedPHASE() {
        ip="test${1}.inp"
        op="test${1}.out"
        # Skip job if already done
        if [ -f "$op" ]; then
           echo Skipping $1 ...
        else
           ./PHASE "$ip" "$op"
        fi
    }
    export -f checkpointedPHASE
    
    # Now start parallel jobs
    parallel checkpointedPHASE {1} ::: {1..1000}
    

    You are in good company doing Bioinformatics with GNU Parallel - bioinformatics tutorial with GNU Parallel.

    0 讨论(0)
  • 2020-12-22 05:08

    "multi-threading" is the wrong word for what you are trying to do. You want to run multiple processes in parallel. Multi-threading refers to having multiple threads of execution running in the same process. Running all of the processes at once and letting the os schedule them for you has been mentioned, as has xargs -P, and you might want to look at gnu parallel. You can also hack a solution in the shell, but this has several issues (namely, it is not even remotely robust). The basic idea is to create a pipe and have each process write a token into the pipe when it is done. At the same time, you read the pipe and start up a new process whenever a token appears. For example:

    #!/bin/bash 
    
    n=${1-4}  # Use first arg as number of processes to run, default is 4
    
    trap 'rm -vf /tmp/fifo' 0
    rm -f /tmp/fifo
    mkfifo /tmp/fifo
    
    cmd() {
        ./PHASE test$1.inp test$1.out
        echo $1 > /tmp/fifo
    }
    
    # spawn first $n processes
    yes | nl | sed ${n}q | while read num line; do
            cmd $num &
    done
    
    
    # Spawn a new process whenever a running process terminates
    yes | nl | sed -e 1,${n}d -e 1000q | {
    while read num line; do
            read -u 5 stub # wait for one to terminate
            cmd $num &
    done 5< /tmp/fifo
    wait
    } &
    exec 3> /tmp/fifo
    wait
    
    0 讨论(0)
  • 2020-12-22 05:13

    Bash does not support multi-threading, however it does support multi-processing. If you change your command to be:

    for i in {1..1000}; do
        ./PHASE test$i.inp test$i.out &
    done
    

    This will run each one a process and your computer will automatically schedule them based on how many core you have. 1000 Processes will have a lot of overhead compared to threads, but while not ideal it should still be fine.

    Edit: Here is a more advanced method if you want to prioritize getting progressive answers:

    #!/bin/bash
    # Number of cores and range end
    n=4
    e=1000
    
    # This function will do the processing
    process() {
        for ((i=$1; i <= $3; i += $2)); do
            ./PHASE test${i}.inp test${i}.out
            echo "Done $i"
        done
    }
    
    # For each core create a process and record the pid
    for ((i=1; i <= n; i++)); do
        process $i $n $e &
    done
    
    # Wait for each process to complete
    wait
    
    0 讨论(0)
提交回复
热议问题