Split Strings into words with multiple word boundary delimiters

前端 未结 30 2637
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  甜味超标
    2020-11-21 05:47

    First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.

    I come across this pretty frequently, and my usual solution doesn't require re.

    One-liner lambda function w/ list comprehension:

    (requires import string):

    split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
        text.split() if word.strip(string.punctuation) != '']
    
    # Call function
    split_without_punc("Hey, you -- what are you doing?!")
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    


    Function (traditional)

    As a traditional function, this is still only two lines with a list comprehension (in addition to import string):

    def split_without_punctuation2(text):
    
        # Split by whitespace
        words = text.split()
    
        # Strip punctuation from each word
        return [word.strip(ignore) for word in words if word.strip(ignore) != '']
    
    split_without_punctuation2("Hey, you -- what are you doing?!")
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    

    It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.

    General Function w/o Lambda or List Comprehension

    For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:

    def split_without(text: str, ignore: str) -> list:
    
        # Split by whitespace
        split_string = text.split()
    
        # Strip any characters in the ignore string, and ignore empty strings
        words = []
        for word in split_string:
            word = word.strip(ignore)
            if word != '':
                words.append(word)
    
        return words
    
    # Situation-specific call to general function
    import string
    final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
    # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
    

    Of course, you can always generalize the lambda function to any specified string of characters as well.

提交回复
热议问题