scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

后端未结

关注

 1  423

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting

相关标签:

1条回答

抹茶落季

2021-01-02 18:46
As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

Edit #1: Here is the important part:
```
from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')
```
Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:
```
OSError: [WinError 8] Not enough storage is available to process this command
```
So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.
0 讨论(0)
发布评论:

提交评论
- 加载中...