My code runs fine with smaller test samples, like 10000 rows of data in X_train
, y_train
. When I call it for millions of rows, I get the resulting
As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.
Edit #1: Here is the important part:
from sklearn.externals import joblib
joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')
Then pass this memmap'ed data to GridSearchCV
under scikit-learn 0.15+.
Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.
I just found a bug for numpy.save
under Python 3.4 but even when fixed the subsequent call to mmap will fail with:
OSError: [WinError 8] Not enough storage is available to process this command
So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).
Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel
memory maps input data with mmap_mode='c'
by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r'
or mmap_mode='r+'
does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.