efficient way to hold and process a big dict in memory in python

后端 未结 4 684
梦谈多话
梦谈多话 2021-01-11 18:20

As I did a bit test, a python dict of int=>int (different value) of 30 million items can easily eats >2G memory on my mac. Since I work with only int to int dict, is there a

相关标签:
4条回答
  • 2021-01-11 18:30

    Another answer added if what you want is just an dictionary-like counter that's easy to use.

    High performance Counter object from Python standard library

    0 讨论(0)
  • 2021-01-11 18:40

    If we knew a bit more about how it would be used it might be easier to suggest good solutions. You say you want to fetch values by key and iterate over all of them, but nothing about if you need to insert/delete data.

    One pretty efficient way of storing data is with the array module. If you do not need to insert/remove data, you could simply have two arrays. The "key" array would be sorted and you could do binary search for the right key. Then you'd just pick the value from the same position in the other array.

    You could easily encapsulate that in a class that behaves dict-like. I don't know if there is a ready solution for this somewhere, but it should not be terribly difficult to implement. That should help you avoid having lots of python objects which consume memory.

    But you might have other requirements that makes such a solution impractical/impossible.

    0 讨论(0)
  • 2021-01-11 18:48

    There are at least two possibilities:

    arrays

    You could try using two arrays. One for the keys, and one for the values so that index(key) == index(value)

    Updated 2017-01-05: use 4-byte integers in array.

    An array would use less memory. On a 64-bit FreeBSD machine with python compiled with clang, an array of 30 million integers uses around 117 MiB.

    These are the python commands I used:

    Python 2.7.13 (default, Dec 28 2016, 20:51:25) 
    [GCC 4.2.1 Compatible FreeBSD Clang 3.8.0 (tags/RELEASE_380/final 262564)] on freebsd11
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from array import array
    >>> a = array('i', xrange(30000000))
    >>> a.itemsize
    4
    

    After importing array, ps reports:

    USER     PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
     rsmith 81023  0.0  0.2  35480   8100  0  I+   20:35     0:00.03 python (python2.7)
    

    After making the array:

    USER     PID %CPU %MEM    VSZ    RSS TT  STAT STARTED    TIME COMMAND
    rsmith 81023 29.0  3.1 168600 128776  0  S+   20:35     0:04.52 python (python2.7)
    

    The Resident Set Size is reported in 1 KiB units, so (128776 - 8100)/1024 = 117 MiB

    With list comprehensions you could easily get a list of indices where the key meets a certain condition. You can then use the indices in that list to access the corresponding values...

    numpy

    If you have numpy available, using that is faster, has lots more features and and uses slightly less RAM:

    Python 2.7.5 (default, Jun 10 2013, 19:54:11) 
    [GCC 4.2.1 Compatible FreeBSD Clang 3.1 ((branches/release_31 156863))] on freebsd9
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> a = np.arange(0, 30000000, dtype=np.int32)
    

    From ps: 6700 KiB after starting Python, 17400 KiB after import numpy and 134824 KiB after creating the array. That's around 114 MiB.

    Furthermore, numpy supports record arrays;

    Python 2.7.5 (default, Jun 10 2013, 19:54:11) 
    [GCC 4.2.1 Compatible FreeBSD Clang 3.1 ((branches/release_31 156863))] on freebsd9
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> a = np.zeros((10,), dtype=('i4,i4'))
    >>> a
    array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
           (0, 0), (0, 0)], 
          dtype=[('f0', '<i4'), ('f1', '<i4')])
    >>> a.dtype.names
    ('f0', 'f1')
    >>> a.dtype.names = ('key', 'value')
    >>> a
    array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
           (0, 0), (0, 0)], 
          dtype=[('key', '<i4'), ('value', '<i4')])
    >>> a[3] = (12, 5429)
    >>> a
    array([(0, 0), (0, 0), (0, 0), (12, 5429), (0, 0), (0, 0), (0, 0), (0, 0),
           (0, 0), (0, 0)], 
          dtype=[('key', '<i4'), ('value', '<i4')])
    >>> a[3]['key']
    12
    

    Here you can access the keys and values separately;

    >>> a['key']
    array([ 0,  0,  0, 12,  0,  0,  0,  0,  0,  0], dtype=int32)
    
    0 讨论(0)
  • Judy-array based solution seems the option I should look into. I'm still looking for a good implementation that can be used by Python. Will update later.

    Update,

    finally I'm experimenting a Judy array wrapper at http://code.google.com/p/py-judy/ . Seems no any document there but I tried to find its methods simply by dir(...) its package and object, however it works.

    Same experiment it eats ~986MB at ~1/3 of standard dict by using judy.JudyIntObjectMap. It also provides JudyIntSet which in some special scenario will save much more memory since it doesn't need to reference to any real Python object as value comparing to JudyIntObjectMap.

    (As tested further as below, JudyArray simply uses several MB to tens of MB, most of ~986MB is actually used by value objects in Python memory space.)

    Here's some code if it helps for you,

    >>> import judy
    >>> dir(judy)
    ['JudyIntObjectMap', 'JudyIntSet', '__doc__', '__file__', '__name__', '__package__']
    >>> a=judy.JudyIntObjectMap()
    >>> dir(a)
    ['__class__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__value_sizeof__', 'by_index', 'clear', 'get', 'iteritems', 'iterkeys', 'itervalues', 'pop']
    >>> a[100]=1
    >>> a[100]="str"
    >>> a["str"]="str"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'non-integer keys not supported'
    >>> for i in xrange(30000000):
    ...     a[i]=i+30000000   #finally eats ~986MB memory
    ... 
    

    Update,

    ok, a JudyIntSet of 30M int as tested.

    >>> a=judy.JudyIntSet()
    >>> a.add(1111111111111111111111111)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: we only support integers in the range [0, 2**64-1]
    

    It totally uses only 5.7MB to store 30M sequential int array [0,30000000) which may due to JudyArray's auto compression. Above 709MB is bcz I used range(...) instead of more proper xrange(...) to generate the data.

    So the size of the core JudyArray with 30M int is simply ignorable.

    If anyone knows a more complete Judy Array wrapper implementation please let me know, since this wrapper only wraps JudyIntObjectMap and JudyIntSet. For int-int dict, JudyIntObjectMap still requires real python object. If we only do counter_add and set on the values, it would be a good idea to store int of values in C space rather than using python object. Hope someone be interested to create or introduce one : )

    0 讨论(0)
提交回复
热议问题