Python: running subprocess in parallel [duplicate]

后端未结

关注

 3  1826

别跟我提以往

相关标签:

3条回答

长发绾君心

2020-12-24 14:24
1. Yes, these md5sum processes will be started in parallel.
2. Yes, the order of md5sums writes will be unpredictable. And generally it is considered a bad practice to share a single resource like file from many processes this way.
Also your way of making p.wait() after the for loop will wait just for the last of md5sum processes to finish and the rest of them might still be running.

But you can modify this code slightly to still have benefits of parallel processing and predictability of synchronized output if you collect the md5sum output into temporary files and collect it back into one file once all processes are done.
```
import subprocess
import os

processes = []
for file in files_output:
    f = os.tmpfile()
    p = subprocess.Popen(['md5sum',file],stdout=f)
    processes.append((p, f))

for p, f in processes:
    p.wait()
    f.seek(0)
    logfile.write(f.read())
    f.close()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

南旧

2020-12-24 14:26

All sub processes are run in parallel. (To avoid this one has to wait explicitly for their completion.) They even can write into the log file at the same time, thus garbling the output. To avoid this you should let each process write into a different logfile and collect all outputs when all processes are finished.

q = Queue.Queue()
result = {}  # used to store the results
for fileName in fileNames:
  q.put(fileName)

def worker():
  while True:
    fileName = q.get()
    if fileName is None:  # EOF?
      return
    subprocess_stuff_using(fileName)
    wait_for_finishing_subprocess()
    checksum = collect_md5_result_for(fileName)
    result[fileName] = checksum  # store it

threads = [ threading.Thread(target=worker) for _i in range(20) ]
for thread in threads:
  thread.start()
  q.put(None)  # one EOF marker for each thread

After this the results should be stored in result.

0 讨论(0)

执笔经年

2020-12-24 14:26

A simple way to collect output from parallel md5sum subprocesses is to use a thread pool and write to the file from the main process:

from multiprocessing.dummy import Pool # use threads
from subprocess import check_output

def md5sum(filename):
    try:
        return check_output(["md5sum", filename]), None
    except Exception as e:
        return None, e

if __name__ == "__main__":
    p = Pool(number_of_processes) # specify number of concurrent processes
    with open("md5sums.txt", "wb") as logfile:
        for output, error in p.imap(md5sum, filenames): # provide filenames
            if error is None:
               logfile.write(output)

the output from md5sum is small so you can store it in memory
imap preserves order
number_of_processes may be different from number of files or CPU cores (larger values doesn't mean faster: it depends on relative performance of IO (disks) and CPU)

You can try to pass several files at once to the md5sum subprocesses.

You don't need external subprocess in this case; you can calculate md5 in Python:

import hashlib
from functools import partial

def md5sum(filename, chunksize=2**15, bufsize=-1):
    m = hashlib.md5()
    with open(filename, 'rb', bufsize) as f:
        for chunk in iter(partial(f.read, chunksize), b''):
            m.update(chunk)
    return m.hexdigest()

To use multiple processes instead of threads (to allow the pure Python md5sum() to run in parallel utilizing multiple CPUs) just drop .dummy from the import in the above code.

0 讨论(0)

热议问题