问题
I have a job that uses the multiprocessing
package and calls a function via
resultList = pool.map(myFunction, myListOfInputParameters)
.
Each entry of the list of input parameters is independent from others.
This job will run a couple of hours. For safety reasons, I would like to store the results that are made in between in regular time intervals, like e.g. once an hour.
How can I do this and be able to continue with the processing when the job was aborted and I want to restart it based on the last available backup?
回答1:
There are at least two possible options.
- Have each call of
myFunction
save its output into a uniquely named file. The file name should be based on or linked to the input data. Use the parent program to gather the results. In this casemyFunction
should return an identifier of the item that is finished. - Use
imap_unordered
instead ofmap
. This will start yielding results as soon as they are available, instead of returing when all processing is finished. Have the parent program save the returned data and a indication which items are finished.
In both cases, the program would have to examine the data saved from previous runs to adjust myListOfInputParameters
when it is being re-started.
Which option is best depends to a large degree on the amount of data returned by myFunction
. If this is a large amount, there is a significant overhead associated with transferring it back to the parent. In that case option 1 is probably best.
Since writing to disk is relatively slow, calculations wil probably go faster with option 2. And it is easier for the parent program to track progress.
Note that you can also use imap_unordered
with option 1.
回答2:
Perhaps use pickle. Read more here:
https://docs.python.org/3/library/pickle.html
Based on aws_apprentice's comment I created a full multiprocessing example in case you weren't sure how to use intermediate results. The first time this is run it will print "None" as there are no intermediate results. Run it again to simulate restarting.
from multiprocessing import Process
import pickle
def proc(name):
data = None
# Load intermediate results if they exist
try:
f = open(name+'.pkl', 'rb')
data = pickle.load(f)
f.close()
except:
pass
# Do something
print(data)
data = "intermediate result for " + name
# Periodically save your intermediate results
f = open(name+'.pkl', 'wb')
pickle.dump(data, f, -1)
f.close()
processes = []
for x in range(5):
p = Process(target=proc, args=("proc"+str(x),))
p.daemon = True
p.start()
processes.append(p)
for process in processes:
process.join()
for process in processes:
process.terminate()
You can also use json if that makes sense to output intermediate results in human readable format. Or sqlite as a database if you need to push data into rows.
来源:https://stackoverflow.com/questions/53996035/dump-intermediate-results-of-multiprocessing-job-to-filesystem-and-continue-with