A simple way to collect output from parallel md5sum subprocesses is to use a thread pool and write to the file from the main process:
from multiprocessing.dummy import Pool # use threads
from subprocess import check_output
def md5sum(filename):
try:
return check_output(["md5sum", filename]), None
except Exception as e:
return None, e
if __name__ == "__main__":
p = Pool(number_of_processes) # specify number of concurrent processes
with open("md5sums.txt", "wb") as logfile:
for output, error in p.imap(md5sum, filenames): # provide filenames
if error is None:
logfile.write(output)
- the output from
md5sum
is small so you can store it in memory
imap
preserves order
number_of_processes
may be different from number of files or CPU cores (larger values doesn't mean faster: it depends on relative performance of IO (disks) and CPU)
You can try to pass several files at once to the md5sum subprocesses.
You don't need external subprocess in this case; you can calculate md5 in Python:
import hashlib
from functools import partial
def md5sum(filename, chunksize=2**15, bufsize=-1):
m = hashlib.md5()
with open(filename, 'rb', bufsize) as f:
for chunk in iter(partial(f.read, chunksize), b''):
m.update(chunk)
return m.hexdigest()
To use multiple processes instead of threads (to allow the pure Python md5sum()
to run in parallel utilizing multiple CPUs) just drop .dummy
from the import in the above code.