问题
My original list_
function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_
down below isa portion fo the actual numpy array.
Pandas data:
import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']
Code:
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
std = np.std(rolling_window(list_, number), axis=1)
Error Message: MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
Full length of the error message:
MemoryError Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
5 return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
6
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)
<__array_function__ internals> in std(*args, **kwargs)
C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
3495
3496 return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497 **kwargs)
3498
3499
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
233 ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234 keepdims=keepdims)
235
236 if isinstance(ret, mu.ndarray):
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
200 # Note that x may not be inexact and that we need it to be an array,
201 # not a scalar.
--> 202 x = asanyarray(arr - arrmean)
203
204 if issubclass(arr.dtype.type, (nt.floating, nt.integer)):
MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
回答1:
Do us the favor of referencing your previous related questions (at least 2). I happened to recall seeing something similar and so look up your previous questions.
Also when asking about an error, show the full traceback (if possible). It should us (and you) identify where the problem occurs, and narrow down possible reasons and fixes.
With the sample list_
(why such a bad name for a numpy array?) of only (35,) shape, the rolling_window
array isn't that large. Plus it's a view
:
In [90]: x =rolling_window(list_, number)
In [91]: x.shape
Out[91]: (26, 5)
However an operation on this array might produce a copy, boosting memory use.
In [96]: np.std(x, axis=1) Out[96]: array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469, 12.54834004, ... 8.07511323]) In [97]: _.shape Out[97]: (26,)
np.std
does:
std = sqrt(mean(abs(x - x.mean())**2))
x.mean(axis=1)
is one value per row, but
In [102]: x.mean(axis=1).shape
Out[102]: (26,)
In [103]: (x-x.mean(axis=1, keepdims=True)).shape
Out[103]: (26, 5)
In [106]: (abs(x-x.mean(axis=1, keepdims=True))**2).shape
Out[106]: (26, 5)
produces an array as big as x
, and will be a full copy; not a strided virtual copy.
Does the error message shape make sense? (2659448, 10000)
Is your window
size 10000? And the expected number of windows the other value?
198. GiB
is a reasonable number given that dimension:
In [94]: 2659448*10000*8/1e9
Out[94]: 212.75584
I'm not going test your code with a large enough array to produce a memory error.
as_strided
is a nice way of generating moving windows, and fast - but it easily blows up the memory usage.
回答2:
Generally, there are two ways to deal with "cannot allocate 198GiB of memory":
Process the data in chunks, or line-by line.
Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the
rolling_window
function so that it loads the initial window (firstn
lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more thann
lines of memory and it'll all work fine.If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).
Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).
This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.
来源:https://stackoverflow.com/questions/65768068/memory-error-utilizing-numpy-arrays-python