问题
I am trying to apply a function to 5 cross validation sets in parallel using Python's multiprocessing
and repeat that for different parameter values, like so:
import pandas as pd
import numpy as np
import multiprocessing as mp
from sklearn.model_selection import StratifiedKFold
#simulated datasets
X = pd.DataFrame(np.random.randint(2, size=(3348,868), dtype='int8'))
y = pd.Series(np.random.randint(2, size=3348, dtype='int64'))
#dummy function to apply
def _work(args):
del(args)
for C in np.arange(0.0,2.0e-3,1.0e-6):
splitter = StratifiedKFold(n_splits=5)
with mp.Pool(processes=5) as pool:
pool_results = \
pool.map(
func=_work,
iterable=((C,X.iloc[train_index],X.iloc[test_index]) for train_index, test_index in splitter.split(X, y))
)
However halfway through execution I get the following error:
Traceback (most recent call last):
File "mre.py", line 19, in <module>
with mp.Pool(processes=5) as pool:
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
I'm running this on Ubuntu 16.04 with 32Gb of memory, and checking htop
during execution it never goes over 18.5Gb, so I don't think I'm running out of memory.
It is definitly due to the splitting of my dataframes with the indexes from splitter.split(X,y)
since when I directly pass my dataframes to the Pool
object no error is thrown.
I saw this answer that says it might be due to too many file dependencies being created, but I have no idea how I might go about fixing that, and isn't the context manager supposed to help avoid this sort of problem?
回答1:
os.fork()
makes a copy of a process, so if you're sitting at about 18 GB of usage, and want to call fork
, you need another 18 GB. Twice 18 is 36 GB, which is well over 32 GB. While this analysis is (intentionally) naive—some things don't get copied on fork—it's probably sufficient to explain the problem.
The solution is either to make the pools earlier, when less memory needs to be copied, or to work harder at sharing the largest objects. Or, of course, add more memory (perhaps just virtual memory, i.e., swap space) to the system.
来源:https://stackoverflow.com/questions/54364064/oserror-errno-12-cannot-allocate-memory-when-using-python-multiprocessing-poo