applying regex to a pandas dataframe

前端未结

关注

 5  1691

I\'m having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:

               Name   Season          School


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-05 13:51
              
            
            
                                                                       
The asked problem can be solved by writing the following code :

import re
def split_it(year):
    x = re.findall('([\d]{4})', year)
    if x :
      return(x.group())

df['Season2'] = df['Season'].apply(split_it)


You were facing this problem as some rows didn't had year in the string
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-05 14:04
              
            
            
                                                                       
When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.

You could use either

df['Season2'] = df['Season'].apply(split_it)


or

df['Season2'] = df['Season'].apply(lambda x: split_it(x))


but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.)  Your function will return a list, though:

>>> df["Season"].apply(split_it)
74     [1982]
84     [1982]
176    [1982]
177    [1983]
243    [1982]
Name: Season, dtype: object


although you could easily change that.  FWIW, I'd use vectorized string operations and do something like

>>> df["Season"].str[:4].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64


or

>>> df["Season"].str.split("-").str[0].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小鲜肉        
                
              
                            
                2020-12-05 14:04
              
            
            
                                                                       
You can simply use str.extract

df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')


Here you locate \d{4}-\d{2} (for example 1982-83) but only extracts the captured group between parenthesis \d{4} (for example 1982)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-05 14:07
              
            
            
                                                                       
I had the exact same issue. Thanks for the answers @DSM. 
FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'. 

def split_it(year):  
    return re.findall('(\d\d\d\d)', year)


Becomes: 

def split_it(year):
    return re.findall('(\d{4})', year)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小鲜肉        
                
              
                            
                2020-12-05 14:08
              
            
            
                                                                       
you can use pandas native function to do it too.
check this page for the pandas functions that accepts regular expression. for your case, you can do
df["Season"].str.extract(r'([\d]{4}))')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复