Split Strings into words with multiple word boundary delimiters

前端 未结 30 2636
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  甜味超标
    2020-11-21 05:52

    If you want a reversible operation (preserve the delimiters), you can use this function:

    def tokenizeSentence_Reversible(sentence):
        setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
        listOfTokens = [sentence]
    
        for delimiter in setOfDelimiters:
            newListOfTokens = []
            for ind, token in enumerate(listOfTokens):
                ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
                listOfTokens = [item for sublist in ll for item in sublist] # flattens.
                listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
                newListOfTokens.extend(listOfTokens)
    
            listOfTokens = newListOfTokens
    
        return listOfTokens
    

提交回复
热议问题