Why must we explicitly pass constants into multiprocessing functions?

问题

I have been working with the multiprocessing package to speed up some geoprocessing (GIS/arcpy) tasks that are redundant and need to be done the same for more than 2,000 similar geometries.

The splitting up works well, but my "worker" function is rather long and complicated because the task itself from start to finish is complicated. I would love to break the steps apart down more but I am having trouble passing information to/from the worker function because for some reason ANYTHING that a worker function under multiprocessing uses needs to be passed in explicitly.

This means I cannot define constants in the body of if __name__ == '__main__' and then use them in the worker function. It also means that my parameter list for the worker function is getting really long - which is super ugly since trying to use more than one parameter also requires creating a helper "star" function and then itertools to rezip them back up (a la the second answer on this question).

I have created a trivial example below that demonstrates what I am talking about. Are there any workarounds for this - a different approach I should be using - or can someone at least explain why this is the way it is?

Note: I am running this on Windows Server 2008 R2 Enterprise x64.

Edit: I seem to have not made my question clear enough. I am not that concerned with how pool.map only takes one argument (although it is annoying) but rather I do not understand why the scope of a function defined outside of if __name__ == '__main__' cannot access things defined inside that block if it is used as a multiprocessing function - unless you explicitly pass it as an argument, which is obnoxious.

import os
import multiprocessing
import itertools

def loop_function(word):
    file_name = os.path.join(root_dir, word + '.txt')
    with open(file_name, "w") as text_file:
        text_file.write(word + " food")

def nonloop_function(word, root_dir): # <------ PROBLEM
    file_name = os.path.join(root_dir, word + '.txt')
    with open(file_name, "w") as text_file:
        text_file.write(word + " food")

def nonloop_star(arg_package):
     return nonloop_function(*arg_package)

# Serial version
#
# if __name__ == '__main__':
# root_dir = 'C:\\hbrowning'
# word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
# for word in word_list:
#     loop_function(word)
#
## --------------------------------------------

# Multiprocessing version
if __name__ == '__main__':
    root_dir = 'C:\\hbrowning'
    word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
    NUM_CORES = 2
    pool = multiprocessing.Pool(NUM_CORES, maxtasksperchild=1)

    results = pool.map(nonloop_star, itertools.izip(word_list, itertools.repeat(root_dir)),
                   chunksize=1)
    pool.close()
    pool.join()

回答1:

The problem is, at least on Windows (although there are similar caveats with *nix fork style of multiprocessing, too) that, when you execute your script, it (to greatly simplify it) effectively ends up as as if you called two blank (shell) processes with subprocess.Popen() and then have them execute:

python -c "from your_script import nonloop_star; nonloop_star(('dog', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('cat', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('yeti', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('parakeet', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('dolphin', 'C:\\hbrowning'))"

one by one as soon as one of those processes finishes with the previous call. That means that your if __name__ == "__main__" block never gets executed (because it's not the main script, it's imported as a module) so anything declared within it is not readily available to the function (as it was never evaluated).

For the staff outside your function you can at least cheat by accessing your module via sys.modules["your_script"] or even with globals() but that works only for the evaluated staff, so anything that was placed within the if __name__ == "__main__" guard is not available as it didn't even had a chance. That's also a reason why you must use this guard on Windows - without it you'd be executing your pool creation, and other code that you nested within the guard, over and over again with each spawned process.

If you need to share read-only data in your multiprocessing functions, just define it in the global namespace of your script, outside of that __main__ guard, and all functions will have the access to it (as it gets re-evaluated when starting a new process) regardless if they are running as separate processes or not.

If you need data that changes then you need to use something that can synchronize itself over different processes - there is a slew of modules designed for that, but most of the time Python's own pickle-based, datagram communicating multiprocessing.Manager (and types it provides), albeit slow and not very flexible, is enough.

回答2:

Python » 3.6.1 Documentation: multiprocessing.pool.Pool

map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though)

There is no Restriction, only it have to be a iterable!
Try a class Container, for instance:

class WP(object):
    def __init__(self, name):
        self.root_dir ='C:\\hbrowning'
        self.name = name

word_list = [WP('dog'), WP('cat'), WP('llama'), WP('yeti'), WP('parakeet'), WP('dolphin')]
results = pool.map(nonloop_star, word_list, chunksize=1)

Note: The Var Types inside the class have to be pickleable!
Read about what-can-be-pickled-and-unpickled

来源：https://stackoverflow.com/questions/44550502/why-must-we-explicitly-pass-constants-into-multiprocessing-functions

标签

python

multiprocessing