Split Strings into words with multiple word boundary delimiters

前端 未结 30 2703
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
相关标签:
30条回答
  • 2020-11-21 06:07

    I had to come up with my own solution since everything I've tested so far failed at some point.

    >>> import re
    >>> def split_words(text):
    ...     rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
    ...     return rgx.findall(text)
    

    It seems to be working fine, at least for the examples below.

    >>> split_words("The hill-tops gleam in morning's spring.")
    ['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
    >>> split_words("I'd say it's James' 'time'.")
    ["I'd", 'say', "it's", "James'", 'time']
    >>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
    ["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
    >>> split_words("google.com email@google.com split_words")
    ['google', 'com', 'email', 'google', 'com', 'split_words']
    >>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
    ['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
    >>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
    ['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
    
    0 讨论(0)
  • 2020-11-21 06:08

    Another way, without regex

    import string
    punc = string.punctuation
    thestring = "Hey, you - what are you doing here!?"
    s = list(thestring)
    ''.join([o for o in s if not o in punc]).split()
    
    0 讨论(0)
  • 2020-11-21 06:08
    join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
    # ...alternatively...
    join = lambda lists: [x for l in lists for x in l]
    

    Then this becomes a three-liner:

    fragments = [text]
    for token in tokens:
        fragments = join(f.split(token) for f in fragments)
    

    Explanation

    This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.

    You can abstract this into a function and have tokens=string.punctuation by default.

    Advantages of this approach:

    • This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
    • You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.
    0 讨论(0)
  • 2020-11-21 06:10

    I like pprzemek's solution because it does not assume that the delimiters are single characters and it doesn't try to leverage a regex (which would not work well if the number of separators got to be crazy long).

    Here's a more readable version of the above solution for clarity:

    def split_string_on_multiple_separators(input_string, separators):
        buffer = [input_string]
        for sep in separators:
            strings = buffer
            buffer = []  # reset the buffer
            for s in strings:
                buffer = buffer + s.split(sep)
    
        return buffer
    
    0 讨论(0)
  • 2020-11-21 06:12

    re.split()

    re.split(pattern, string[, maxsplit=0])

    Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

    >>> re.split('\W+', 'Words, words, words.')
    ['Words', 'words', 'words', '']
    >>> re.split('(\W+)', 'Words, words, words.')
    ['Words', ', ', 'words', ', ', 'words', '.', '']
    >>> re.split('\W+', 'Words, words, words.', 1)
    ['Words', 'words, words.']
    
    0 讨论(0)
  • 2020-11-21 06:12
    def get_words(s):
        l = []
        w = ''
        for c in s.lower():
            if c in '-!?,. ':
                if w != '': 
                    l.append(w)
                w = ''
            else:
                w = w + c
        if w != '': 
            l.append(w)
        return l
    

    Here is the usage:

    >>> s = "Hey, you - what are you doing here!?"
    >>> print get_words(s)
    ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    
    0 讨论(0)
提交回复
热议问题