Capture all consecutive all-caps words with regex in python?

后端未结

关注

 4  959

I am trying to match all consecutive all caps words/phrases using regex in Python. Given the following:

    text = \"The following words are ALL CAPS. The follow


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-02-10 20:48
              
            
            
                                                                       
Keeping your regex, you can use strip() and filter:

string = "The following words are ALL CAPS. The following word is in CAPS."
result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)])
# ['ALL CAPS', 'CAPS']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-02-10 20:54
              
            
            
                                                                       
Your regex is relying on explicit conditions(space after letters).

matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)


Capture A to Z repetitions if there are no trailing lowercase or none-alphabet character.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2021-02-10 21:01
              
            
            
                                                                       
This one does the job:

import re
text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS."
matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text)
print matches


Output:

['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']


Explanation:

(           : start group 1
  \b        : word boundary
  (?:       : start non capture group
    [A-Z]+  : 1 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]*  : 0 or more capitals
   |        : OR
    [A-Z]*  : 0 or more capitals
    [a-z]?  : 0 or 1 small letter
    [A-Z]+  : 1 or more capitals
  )         : end group
  \b        : word boundary
  (?:       : non capture group
    \s+     : 1 or more spaces
    (?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above
    \b      : word boundary
  )*        : 0 or more time the non capture group
)           : end group 1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-02-10 21:05
              
            
            
                                                                       
Assuming you want to start and end on a letter, and only include letters and whitespace

\b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b


|[A-Z] to capture just I or A
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复