Remove small words using Python

前端未结

关注

 2  844

Is it possible use regex to remove small words in a text? For example, I have the following string (text):

anytext = \" in the echo chamber from Ontario duo


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-03 14:30
              
            
            
                                                                       
Certainly, it's not that hard either:

shortword = re.compile(r'\W*\b\w{1,3}\b')


The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.

>>> shortword.sub('', anytext)
' echo chamber from Ontario '


The \b boundary matches are important here, they ensure that you don't match just the first or last 3 characters of a word.

The \W* at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in \W, use \s if you only want to remove preceding whitespace.

For what it's worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson's version collapses multiple whitespace characters into one space. Not sure if that matters to you.

His list comprehension solution is the faster of the two:

>>> import timeit
>>> def re_remove(text): return shortword.sub('', text)
... 
>>> def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)
... 
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import re_remove as remove')
7.0774190425872803
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')
6.4250049591064453

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2020-12-03 14:35
              
            
            
                                                                       
I don't think you need a regex for this simple example anyway ...

' '.join(word for word in anytext.split() if len(word)>3)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复