问题
I have the following code, where I try to parallelize:
import numpy as np
from joblib import Parallel, delayed
lst = [[0.0, 1, 2], [3, 4, 5], [6, 7, 8]]
arr = np.array(lst)
w, v = np.linalg.eigh(arr)
def proj_func(i):
return np.dot(v[:,i].reshape(-1, 1), v[:,i].reshape(1, -1))
proj = Parallel(n_jobs=-1)(delayed(proj_func)(i) for i in range(len(w)))
proj
returns a really large list and it's causing memory issues.
Is there a way I could work around this?
I had thought about returning a generator rather than a list, but I don't know how to do this. Any other ways would be welcomed too.
回答1:
Q : "Is there a way I could work around this?"
That heck depends on what such this
stands for.
Pre-conditions, set for a fair use of the
np.linalg.eigh()
-method, were accidentally not met in the MCVE code snippet posted above, yet these remain outside of the scope of this post. If any complex inputs and results will get processed accordingly, some of the here-referredN
-scaled RAM-allocations will, for obvious reasons, actually get2*N
-sized or4*N*N
-sized or8*N*N*N
-sized factors in the below depicted scaling of the RAM-footprint requirements, yet the core-message ought be clear and sound from the plainN
-factored sizing dependencies used below :
Is The MEMORY Sizing The Bottleneck ?
Space for static-sized data :
Given your MCVE, as was posted above, the MEMORY-sizing depends on N = arr.size
and your system either has at least :
- N * 3 * 8 [B]
RAM for holding lst, arr, w
- N * N * 8 [B]
RAM for holding v
Put altogether, there will have to be way more than <_nCPUs_> * 8 * N * ( 3 + N ) [B]
RAM-space, just to introduce n_jobs == -1
full-copies of the python interpreter process ( definitely that for MacOS / WinOS and most probably also for linux, as fork-method was documented in 2019/2020 to yield unstable/unsafe results ) before the code tried to do even the first call to proj_func( i )
If that is not the capacity of your system, you may straight stop reading.
Next ?
Space for dynamic data :
Any call of the next N
-calls to proj_func( i )
, each one adds additional RAM-allocation of - N * N * 8 [B]
RAM-space for holding the np.dot()
-results
Altogether more than k * N * N * N * 8 [B]
RAM for holding np.dot()
-results, where k >> 2
, as each of these N
-results has to get SER
-packed ( again allocating some RAM-space for doing that ), next each such SER
-ed-payload has to get transmitted from a remote-joblib.Parallel()(delayed()(...))
-executor forward to the main-process ( here again allocating some RAM-space for the SER
-ed payload ) next this RAM-stored intermediate binary-payload has to get DES
-erialised ( so again allocating some additional RAM-space for storing the DES
-ed data of the original size N * N * 8 [B]
) so as to get this SER/DES-pipelined product finally N
-times appended to the inital proj == []
as the above specified syntax of using thejoblib.Parallel(…)( delayed( proj_func )( i ) for i in range( len( w ) ) )
-clause insists and imperatively enforces.
<_nCPUs_> * 8 * N * ( 3 + N ) // static storage: data + all python process-replicas
+
<_nCPUs_> * 8 * N * N * k // dynamic storage: SER/DES on joblib.Parallel()(delayed…)
+
8 * N * N * N // collective storage: proj-collected N-( np.dot() )-results
~
= 8 * N * ( N * N + <_nCPUs_> * ( 3 + N * ( k + 1 ) ) )
RESUMÉ :
This soon scales ( even when we assumed no other python-process import
-s and static data ) well above an "ordinary" host computing device's RAM-footprint for any N
== arr.size >= 1E3
:
>>> nCPUs = 4; k = 2.1; [ ( 8 * N * ( N * N + nCPUs * (3+N*(k+1)))/1E9 ) for N in ( 1E3, 1E4, 1E5, 1E6 ) ]
[8.099296, 8009.92096, 8000992.0096, 8000099200.096]
>>> nCPUs = 8; k = 2.1; [ ( 8 * N * ( N * N + nCPUs * (3+N*(k+1)))/1E9 ) for N in ( 1E3, 1E4, 1E5, 1E6 ) ]
[8.198592, 8019.84192, 8001984.0192, 8000198400.192]
>>> nCPUs = 16; k = 2.1; [ ( 8 * N * ( N * N + nCPUs * (3+N*(k+1)))/1E9 ) for N in ( 1E3, 1E4, 1E5, 1E6 ) ]
[8.397184, 8039.68384, 8003968.0384, 8000396800.384]
8[GB] |...[GB] | |...[GB] | | |...[GB]
8 [TB] |... [TB] | |... [TB]
8 [PB] |... [PB]
8 [EB]
EPILOGUE :
So a simple SLOC, using an as easy syntax as that of the joblib.Parallel()(delayed()())
can immediately devastate the whole so far performed efforts of the computing graph in one, unsalvageable manner if a proper design effort was not spent on at least a raw data-processing quantitative estimate.
来源:https://stackoverflow.com/questions/60691062/how-to-handle-really-large-objects-returned-from-the-joblib-parallel