Why does union consume more memory if the argument is a set?

后端 未结 1 961
不知归路
不知归路 2021-01-02 00:52

I\'m puzzled by this behaviour of memory allocation of sets:

>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).unio         


        
相关标签:
1条回答
  • 2021-01-02 01:39

    In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.

    The implementation that's used when the argument is a set makes the following assumption documented in the code:

    /* Do one big resize at the start, rather than
     * incrementally resizing as we insert new keys.  Expect
     * that there will be no (or few) overlapping keys.
     */
    

    Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set overallocating memory.

    I am not sure I would call this a bug though. The implementer of set chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.

    The upside of the tradeoff is that in many cases the pre-allocation results in better performance:

    In [20]: rhs = list(range(1000))
    
    In [21]: %timeit set().union(rhs)
    10000 loops, best of 3: 30 us per loop
    
    In [22]: rhs = set(range(1000))
    
    In [23]: %timeit set().union(rhs)
    100000 loops, best of 3: 14 us per loop
    

    Here, the set version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs.

    If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.

    0 讨论(0)
提交回复
热议问题