How to make word boundary \b not match on dashes

后端未结

关注

 3  466

I simplified my code to the specific problem I am having.

import re
pattern = re.compile(r\'\\bword\\b\')
result = pattern.sub(lambda x: \"match\", \"-word-


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-05 04:35
              
            
            
                                                                       
Instead of word boundaries, you could also match the character before and after the word with a (\s|^) and (\s|$) pattern. 

Breakdown: \s matches every whitespace character, which seems to be what you are trying to achieve, as you are excluding the dashes. The ^ and $ ensure that if the word is either the first or last in the string(ie. no character before or after) those are matched too.

Your code would become something like this:

pattern = re.compile(r'(\s|^)(word)(\s|$)')
result = pattern.sub(r"\1match\3", "-word- word")


Because this solution uses character classes such as \s, it means that those could be easily replaced or extended. For example if  you wanted your words to be delimited by spaces or commas, your pattern would become something like this: r'(,|\s|^)(word)(,|\s|$)'.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-01-05 04:38
              
            
            
                                                                       
\b basically denotes a word boundary on characters other than [a-zA-Z0-9_] which includes spaces as well. Surround word with negative lookarounds to ensure there is no non-space character after and before it:

re.compile(r'(?<!\S)word(?!\S)')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2021-01-05 05:00
              
            
            
                                                                       
What you need is a negative lookbehind.

pattern = re.compile(r'(?<!-)\bword\b')
result = pattern.sub(lambda x: "match", "-word- word")


To cite the documentation:


  (?<!...)
      Matches if the current position in the string is not preceded by a match for ....


So this will only match, if the word-break \b is not preceded with a minus sign -.

If you need this for the end of the string you'll have to use a negative lookahead which will look like this: (?!-). The complete regular expression will then result in: (?<!-)\bword(?!-)\b 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复