Split Strings into words with multiple word boundary delimiters

前端 未结 30 2611
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
相关标签:
30条回答
  • 2020-11-21 05:52

    So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:

    >>> import re  # Will be splitting on: , <space> - ! ? :
    >>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
    ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    

    where:

    • the […] matches one of the separators listed inside,
    • the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
    • the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched separators), and
    • filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).

    This re.split() precisely "splits with multiple separators", as asked for in the question title.

    This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).

    The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!

    0 讨论(0)
  • 2020-11-21 05:52

    If you want a reversible operation (preserve the delimiters), you can use this function:

    def tokenizeSentence_Reversible(sentence):
        setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
        listOfTokens = [sentence]
    
        for delimiter in setOfDelimiters:
            newListOfTokens = []
            for ind, token in enumerate(listOfTokens):
                ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
                listOfTokens = [item for sublist in ll for item in sublist] # flattens.
                listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
                newListOfTokens.extend(listOfTokens)
    
            listOfTokens = newListOfTokens
    
        return listOfTokens
    
    0 讨论(0)
  • 2020-11-21 05:55

    I had a similar dilemma and didn't want to use 're' module.

    def my_split(s, seps):
        res = [s]
        for sep in seps:
            s, res = res, []
            for seq in s:
                res += seq.split(sep)
        return res
    
    print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
    ['1111', '', '2222', '3333', '4444', '5555', '6666']
    
    0 讨论(0)
  • 2020-11-21 05:55

    try this:

    import re
    
    phrase = "Hey, you - what are you doing here!?"
    matches = re.findall('\w+', phrase)
    print matches
    

    this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

    0 讨论(0)
  • 2020-11-21 05:56

    In Python 3, your can use the method from PY4E - Python for Everybody.

    We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

    your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

    Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.

    Your can see the "punctuation":

    In [10]: import string
    
    In [11]: string.punctuation
    Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  
    

    For your example:

    In [12]: your_str = "Hey, you - what are you doing here!?"
    
    In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))
    
    In [14]: line = line.lower()
    
    In [15]: words = line.split()
    
    In [16]: print(words)
    ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    

    For more information, you can refer:

    • PY4E - Python for Everybody
    • str.translate
    • str.maketrans
    • Python String maketrans() Method
    0 讨论(0)
  • 2020-11-21 05:56

    I'm re-acquainting myself with Python and needed the same thing. The findall solution may be better, but I came up with this:

    tokens = [x.strip() for x in data.split(',')]
    
    0 讨论(0)
提交回复
热议问题