scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

后端 未结 1 423
攒了一身酷
攒了一身酷 2021-01-02 18:33

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting

相关标签:
1条回答
  • 2021-01-02 18:46

    As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

    Edit #1: Here is the important part:

    from sklearn.externals import joblib
    
    joblib.dump(X_train, some_filename)
    X_train = joblib.load(some_filename, mmap_mode='r+')
    

    Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

    Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

    I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

    OSError: [WinError 8] Not enough storage is available to process this command
    

    So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

    Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

    0 讨论(0)
提交回复
热议问题