Why does union consume more memory if the argument is a set?

后端未结

关注

 1  961

I\'m puzzled by this behaviour of memory allocation of sets:

>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).unio


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-01-02 01:39
              
            
            
                                                                       
In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.

The implementation that's used when the argument is a set makes the following assumption documented in the code:

/* Do one big resize at the start, rather than
 * incrementally resizing as we insert new keys.  Expect
 * that there will be no (or few) overlapping keys.
 */


Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set overallocating memory.

I am not sure I would call this a bug though. The implementer of set chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.

The upside of the tradeoff is that in many cases the pre-allocation results in better performance:

In [20]: rhs = list(range(1000))

In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop

In [22]: rhs = set(range(1000))

In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop


Here, the set version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs.

If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复