Dump intermediate results of multiprocessing job to filesystem and continue with processing later on

问题

I have a job that uses the multiprocessing package and calls a function via

resultList = pool.map(myFunction, myListOfInputParameters).

Each entry of the list of input parameters is independent from others.

This job will run a couple of hours. For safety reasons, I would like to store the results that are made in between in regular time intervals, like e.g. once an hour.

How can I do this and be able to continue with the processing when the job was aborted and I want to restart it based on the last available backup?

回答1:

There are at least two possible options.

Have each call of myFunction save its output into a uniquely named file. The file name should be based on or linked to the input data. Use the parent program to gather the results. In this case myFunction should return an identifier of the item that is finished.
Use imap_unordered instead of map. This will start yielding results as soon as they are available, instead of returing when all processing is finished. Have the parent program save the returned data and a indication which items are finished.

In both cases, the program would have to examine the data saved from previous runs to adjust myListOfInputParameters when it is being re-started.

Which option is best depends to a large degree on the amount of data returned by myFunction. If this is a large amount, there is a significant overhead associated with transferring it back to the parent. In that case option 1 is probably best.

Since writing to disk is relatively slow, calculations wil probably go faster with option 2. And it is easier for the parent program to track progress.

Note that you can also use imap_unordered with option 1.

回答2:

Perhaps use pickle. Read more here:

https://docs.python.org/3/library/pickle.html

Based on aws_apprentice's comment I created a full multiprocessing example in case you weren't sure how to use intermediate results. The first time this is run it will print "None" as there are no intermediate results. Run it again to simulate restarting.

from multiprocessing import Process
import pickle

def proc(name):
  data = None

  # Load intermediate results if they exist
  try:
    f = open(name+'.pkl', 'rb')
    data = pickle.load(f)
    f.close()
  except:
    pass

  # Do something
  print(data)
  data = "intermediate result for " + name

  # Periodically save your intermediate results
  f = open(name+'.pkl', 'wb')
  pickle.dump(data, f, -1)
  f.close()

processes = []
for x in range(5):
  p = Process(target=proc, args=("proc"+str(x),))
  p.daemon = True
  p.start()
  processes.append(p)

for process in processes:
  process.join()

for process in processes:
  process.terminate()

You can also use json if that makes sense to output intermediate results in human readable format. Or sqlite as a database if you need to push data into rows.

来源：https://stackoverflow.com/questions/53996035/dump-intermediate-results-of-multiprocessing-job-to-filesystem-and-continue-with

标签

python

python-multiprocessing