Data size in memory vs. on disk

纵饮孤独 提交于 2019-11-27 04:52:31

Python Object Data Size

If the data is stored in some python object, there will be a little more data attached to the actual data in memory.

This may be easily tested.

It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible.

Here is the iPython code used to generate the plot

%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt

max_doubles = 10000

raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)

# test double size
for n in size_range:
    double_array = array.array('d', [random.random() for _ in xrange(n)])
    double_string = double_array.tostring()
    double_list = double_array.tolist()
    double_set = set(double_list)
    double_tuple = tuple(double_list)

    raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
    array_size.append(sys.getsizeof(double_array))
    string_size.append(sys.getsizeof(double_string))
    list_size.append(sys.getsizeof(double_list))
    set_size.append(sys.getsizeof(double_set))
    tuple_size.append(sys.getsizeof(double_tuple))

# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
    size_range, raw_size, 
    size_range, array_size, 
    size_range, string_size,
    size_range, list_size,
    size_range, set_size,
    size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')

In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python.

The float object used in CPython is defined in floatobject.h:

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

where PyObject_HEAD is a macro that expands to the PyObject struct:

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. This is confirmed by sys.getsizeof(1.0) == 24.

This means that a list of n doubles in Python takes at least 8*n bytes of memory just to store the pointers (PyObject*) to the number objects, and each number object requires additional 24 bytes. To test it, try running the following lines in the Python REPL:

>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]

and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). Note that if you tried:

>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]

you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!