I\'m puzzled by this behaviour of memory allocation of set
s:
>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).unio
In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.
The implementation that's used when the argument is a set
makes the following assumption documented in the code:
/* Do one big resize at the start, rather than
* incrementally resizing as we insert new keys. Expect
* that there will be no (or few) overlapping keys.
*/
Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set
overallocating memory.
I am not sure I would call this a bug though. The implementer of set
chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.
The upside of the tradeoff is that in many cases the pre-allocation results in better performance:
In [20]: rhs = list(range(1000))
In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop
In [22]: rhs = set(range(1000))
In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop
Here, the set
version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs
.
If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.