问题
I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.
My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).
I've written some pseudo code to demonstrate my idea:
import multiprocessing
def worker_func(data, args):
# do sth...
return res
def compute(data, process_num, niter):
data
result = []
args = init()
for iter in range(niter):
args_chunk = split_args(args, process_num)
pool = multiprocessing.Pool()
for i in range(process_num):
result.append(pool.apply_async(worker_func,(data, args_chunk[i])))
pool.close()
pool.join()
# aggregate result and update args
for res in result:
args = update_args(res.get())
if __name__ == "__main__":
compute(data, 4, 100)
The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.
I've come up with two potential solutions:
- share data among processes (it's ndarray), that's the title of this question.
- Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.
So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.
Thanks in advance.
回答1:
If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.
If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.
Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.
Here is a modified version of your example that runs.
import numpy as np
import ray
ray.init()
@ray.remote
def worker_func(data, i):
# Do work. This function will have read-only access to
# the data array.
return 0
data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)
# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
result_ids.append(worker_func.remote(data_id, i))
# Get the results.
results = ray.get(result_ids)
Note that if we omitted the line data_id = ray.put(data)
and instead called worker_func.remote(data, i)
, then the data
array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put
, we can store the object in the object store a single time.
回答2:
Conceptually for your problem, using mmap
is a standard way.
This way, the information can be retrieved from mapped memory by multiple processes
Basic understanding of mmap:
https://en.wikipedia.org/wiki/Mmap
Python has "mmap" module(import mmap
)
The documentation of python standard and some examples are in below link
https://docs.python.org/2/library/mmap.html
来源:https://stackoverflow.com/questions/54580947/python3-multiprocess-shared-numpy-arrayread-only