filter dataframe rows based on length of column values

后端未结

关注

 4  1271

I have a pandas dataframe as follows:

df = pd.DataFrame([ [1,2], [np.NaN,1], [\'test string1\', 5]], columns=[\'A\',\'B\'] )

df
              A  B
0


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-17 19:20
              
            
            
                                                                       
I had to cast to a string for Diego's answer to work:

df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-17 19:27
              
            
            
                                                                       
In [42]: df
Out[42]:
              A  B                         C          D
0             1  2                         2 2017-01-01
1           NaN  1                       NaN 2017-01-02
2  test string1  5  test string1test string1 2017-01-03

In [43]: df.dtypes
Out[43]:
A            object
B             int64
C            object
D    datetime64[ns]
dtype: object

In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
     A  B    C          D
0    1  2    2 2017-01-01
1  NaN  1  NaN 2017-01-02


Explanation:

df.select_dtypes(['object']) selects only columns of object (str) dtype:

In [45]: df.select_dtypes(['object'])
Out[45]:
              A                         C
0             1                         2
1           NaN                       NaN
2  test string1  test string1test string1

In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
       A      C
0  False  False
1  False  False
2   True   True


now we can "aggregate" it as follows:

In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0    False
1    False
2     True
dtype: bool


finally we can select only those rows where value is False:

In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
     A  B    C          D
0    1  2    2 2017-01-01
1  NaN  1  NaN 2017-01-02

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-01-17 19:29
              
            
            
                                                                       
If based on column A

In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
     A  B
0    1  2
1  NaN  1


If based on all columns

In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
     A  B
0    1  2
1  NaN  1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-17 19:31
              
            
            
                                                                       
Use the apply function of series, in order to keep them:

df = df[df['A'].apply(lambda x: len(x) <= 10)]
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复