Comparison of Pandas lookup times

前端未结

关注

 2  1145

After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.

Here is the set up...

import pan


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2021-02-07 02:46
              
            
            
                                                                       
The disparity in these %timeit results

In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop

In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop


also shows up in the pure NumPy equality comparisons:

In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop

In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop


Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython
function
which loops through the values of the underlying NumPy array,
df1['letter'].values. It is essentially doing the same thing as
df1['letter'].values == 'ben' but with different handling of NaNs.  

Moreover, notice that simply accessing the items in df1['letter'] in
sequential order can be done more quickly than doing the same for df2['letter']:

In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop

In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop


The difference in times within each of these three sets of %timeit tests are
roughly the same. I think that is because they all share the same cause.

Since the letter column holds strings, the NumPy arrays df1['letter'].values and
df2['letter'].values have dtype object and therefore they hold
pointers to the memory location of the arbitrary Python objects (in this case strings).

Consider the memory location of the strings stored in the DataFrames, df1 and
df2. In CPython the id returns the memory location of the object:

memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
                       'df2': list(map(id, df2['letter'])), })

               df1              df2
0  140226328244040  140226299303840
1  140226328243088  140226308389048
2  140226328243872  140226317328936
3  140226328243760  140226230086600
4  140226328243368  140226285885624


The strings in df1 (after the first dozen or so) tend to appear sequentially
in memory, while sorting causes the strings in df2 (taken in order) to be
scattered in memory:

In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]: 
         df1         df2
0        NaN         NaN
1     -952.0   9085208.0
2      784.0   8939888.0
3     -112.0 -87242336.0
4     -392.0  55799024.0
5     -392.0   5436736.0
6      952.0  22687184.0
7       56.0 -26436984.0
8     -448.0  24264592.0
9      -56.0  -4092072.0
10    -168.0 -10421232.0
11 -363584.0   5512088.0
12      56.0 -17433416.0
13      56.0  40042552.0
14      56.0 -18859440.0
15      56.0 -76535224.0
16      56.0  94092360.0
17      56.0  -4189368.0
18      56.0     73840.0
19      56.0  -5807616.0
20      56.0  -9211680.0
21      56.0  20571736.0
22      56.0 -27142288.0
23      56.0   5615112.0
24      56.0  -5616568.0
25      56.0   5743152.0
26      56.0 -73057432.0
27      56.0  -4988200.0
28      56.0  85630584.0
29      56.0  -4706136.0


Most of the strings in df1 are 56 bytes apart:

In [14]: 
In [16]: diffs['df1'].value_counts()
Out[16]: 
 56.0           986109
 120.0           13671
-524168.0          215
-56.0                1
-12664712.0          1
 41136.0             1
-231731080.0         1
Name: df1, dtype: int64

In [20]: len(diffs['df1'].value_counts())
Out[20]: 7


In contrast the strings in df2 are scattered all over the place:

In [17]: diffs['df2'].value_counts().head()
Out[17]: 
-56.0     46
 56.0     44
 168.0    39
-112.0    37
-392.0    35
Name: df2, dtype: int64

In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764


When these objects (strings) are located sequentially in memory, their values
can be retrieved more quickly. This is why the equality comparisons performed by
df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values
== 'ben'. The lookup time is smaller.

This memory accessing issue also explains why there is no disparity in the
%timeit results for the value column.

In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop

In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop


df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object
arrays, their values are packed together contiguously in memory. Sorting df1
with df2 = df1.sort_values('letter') causes the values in df2['value'] to be
reordered, but since the values are copied into a new NumPy array, the values
are located sequentially in memory. So accessing the values in df2['value'] can
be done just as quickly as those in df1['value'].
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2021-02-07 02:49
              
            
            
                                                                       
(1)  pandas currently has no knowledge of the sortedness of a column.

If you want to take advantage of sorted data, you could use df2.letter.searchsorted  See @unutbu's answer for an explanation of what's actually causing the difference in time here.

(2)  The hash table that sits underneath the index is lazily created, then cached.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复