问题
I have a multiprocessing
script with pool.map
that works. The problem is that not all processes take as long to finish, so some processes fall asleep because they wait until all processes are finished (same problem as in this question). Some files are finished in less than a second, others take minutes (or hours).
If I understand the manual (and this post) correctly, pool.imap
is not waiting for all the processes to finish, if one is done, it is providing a new file to process. When I try that, the script is speeding over the files to process, the small ones are processed as expected, the large files (that take more time to process) don't finish until the end (are killed without notice ?). Is this normal behavior for pool.imap
, or do I need to add more commands/parameters ? When I add the time.sleep(100)
in the else
part as test, it is processing more large files but the other processes fall asleep. Any suggestions ? Thanks
def process_file(infile):
#read infile
#compare things in infile
#acquire Lock, save things in outfile, release Lock
#delete infile
def main():
#nprocesses = 8
global filename
pathlist = ['tmp0', 'tmp1', 'tmp2', 'tmp3', 'tmp4', 'tmp5', 'tmp6', 'tmp7', 'tmp8', 'tmp9']
for d in pathlist:
os.chdir(d)
todolist = []
for infile in os.listdir():
todolist.append(infile)
try:
p = Pool(processes=nprocesses)
p.imap(process_file, todolist)
except KeyboardInterrupt:
print("Shutting processes down")
# Optionally try to gracefully shut down the worker processes here.
p.close()
p.terminate()
p.join()
except StopIteration:
continue
else:
time.sleep(100)
os.chdir('..')
p.close()
p.join()
if __name__ == '__main__':
main()
回答1:
Since you already put all your files in a list, you could put them directly into a queue. The queue is then shared with your sub-processes that take the file names from the queue and do their stuff. No need to do it twice (first into list, then pickle list by Pool.imap). Pool.imap is doing exactly the same but without you knowing it.
todolist = []
for infile in os.listdir():
todolist.append(infile)
can be replaced by:
todolist = Queue()
for infile in os.listdir():
todolist.put(infile)
The complete solution would then look like:
def process_file(inqueue):
for infile in iter(inqueue.get, "STOP"):
#do stuff until inqueue.get returns "STOP"
#read infile
#compare things in infile
#acquire Lock, save things in outfile, release Lock
#delete infile
def main():
nprocesses = 8
global filename
pathlist = ['tmp0', 'tmp1', 'tmp2', 'tmp3', 'tmp4', 'tmp5', 'tmp6', 'tmp7', 'tmp8', 'tmp9']
for d in pathlist:
os.chdir(d)
todolist = Queue()
for infile in os.listdir():
todolist.put(infile)
process = [Process(target=process_file,
args=(todolist) for x in range(nprocesses)]
for p in process:
#task the processes to stop when all files are handled
#"STOP" is at the very end of queue
todolist.put("STOP")
for p in process:
p.start()
for p in process:
p.join()
if __name__ == '__main__':
main()
来源:https://stackoverflow.com/questions/40795094/python-multiprocessing-pool-map-and-imap