Fast pandas filtering

后端未结

关注

 3  670

I want to filter a pandas dataframe, if the name column entry has an item in a given list.

Here we have a DataFrame

x = DataFrame(
    [[\'sam\', 328], [


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-02-06 13:39
              
            
            
                                                                       
If your data repeats a lot of values, try using the 'categorical' data type for that column and then applying boolean filtering. Much more flexible than using indices and, at least in my case, much faster.
data = pd.read_csv('data.csv', dtype={'name':'category'})
data[(data.name=='sam')&(data.score>1)]

or
names=['sam','ruby']    
data[data.name.isin(names)]

For the ~15 million row, ~200k unique terms dataset I'm working with in pandas 1.2, %timeit results are:

boolean filter on object column: 608ms
.loc filter on same object column as index: 281ms
boolean filter on same object column as 'categorical' type: 16ms

From there, add the .sum() or whatever aggregation function you're looking for.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-02-06 13:44
              
            
            
                                                                       
Try using isin (thanks to DSM for suggesting loc over ix here):

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])

In [79]: names = ['sam', 'ruby']

In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool

In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541




CT Zhu suggests a faster alternative using np.in1d:

In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop

In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2021-02-06 13:50
              
            
            
                                                                       
If I need to search on a field, I have noticed that it helps immensely if I change the index of the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.

So in your case the following could be used to search and filter by name.

df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]], 
                 columns=['name', 'score'])
names = ['sam', 'ruby']

df_searchable = df.set_index('name')

df_searchable[df_searchable.index.isin(names)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复