Memory error utilizing numpy arrays Python

问题

My original list_ function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_ down below isa portion fo the actual numpy array.

Pandas data:

import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']

Code:

number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

std = np.std(rolling_window(list_, number), axis=1)

Error Message: MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64

Full length of the error message:

MemoryError                               Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
      5     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
      6 
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)

<__array_function__ internals> in std(*args, **kwargs)

C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
   3495 
   3496     return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497                          **kwargs)
   3498 
   3499 

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
    232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
    233     ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234                keepdims=keepdims)
    235 
    236     if isinstance(ret, mu.ndarray):

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
    200     # Note that x may not be inexact and that we need it to be an array,
    201     # not a scalar.
--> 202     x = asanyarray(arr - arrmean)
    203 
    204     if issubclass(arr.dtype.type, (nt.floating, nt.integer)):

MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64

回答1:

Do us the favor of referencing your previous related questions (at least 2). I happened to recall seeing something similar and so look up your previous questions.

Also when asking about an error, show the full traceback (if possible). It should us (and you) identify where the problem occurs, and narrow down possible reasons and fixes.

With the sample list_ (why such a bad name for a numpy array?) of only (35,) shape, the rolling_window array isn't that large. Plus it's a view:

In [90]: x =rolling_window(list_, number)
In [91]: x.shape
Out[91]: (26, 5)

However an operation on this array might produce a copy, boosting memory use.

In [96]: np.std(x, axis=1) Out[96]: array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469, 12.54834004, ... 8.07511323]) In [97]: _.shape Out[97]: (26,)

np.std does:

std = sqrt(mean(abs(x - x.mean())**2))

x.mean(axis=1) is one value per row, but

In [102]: x.mean(axis=1).shape
Out[102]: (26,)
In [103]: (x-x.mean(axis=1, keepdims=True)).shape
Out[103]: (26, 5)
In [106]: (abs(x-x.mean(axis=1, keepdims=True))**2).shape
Out[106]: (26, 5)

produces an array as big as x, and will be a full copy; not a strided virtual copy.

Does the error message shape make sense? (2659448, 10000) Is your window size 10000? And the expected number of windows the other value?

198. GiB is a reasonable number given that dimension:

In [94]: 2659448*10000*8/1e9
Out[94]: 212.75584

I'm not going test your code with a large enough array to produce a memory error.

as_strided is a nice way of generating moving windows, and fast - but it easily blows up the memory usage.

回答2:

Generally, there are two ways to deal with "cannot allocate 198GiB of memory":

Process the data in chunks, or line-by line.

Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the rolling_window function so that it loads the initial window (first n lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more than n lines of memory and it'll all work fine.

If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).
Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).

This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.

来源：https://stackoverflow.com/questions/65768068/memory-error-utilizing-numpy-arrays-python

标签

python

function

numpy

out-of-memory

numpy-ndarray