Split Strings into words with multiple word boundary delimiters

前端未结

关注

 30  2709

既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H


      
      
        
          30条回答        

        
                    
            
            
                         
                
              
              
                
                   不思量自难忘°
                                             
                
                
                (楼主)
            
              
              
                2020-11-21 06:06
              

            
            
                        
Pro-Tip: Use string.translate for the fastest string operations Python has.

Some proof...

First, the slow way (sorry pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552


Next, we use re.findall() (as given by the suggested answer). MUCH faster:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094


Finally, we use translate:

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934


Explanation:

string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.

It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它30个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复