Split Strings into words with multiple word boundary delimiters

前端 未结 30 2694
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  遇见更好的自我
    2020-11-21 06:08

    join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
    # ...alternatively...
    join = lambda lists: [x for l in lists for x in l]
    

    Then this becomes a three-liner:

    fragments = [text]
    for token in tokens:
        fragments = join(f.split(token) for f in fragments)
    

    Explanation

    This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.

    You can abstract this into a function and have tokens=string.punctuation by default.

    Advantages of this approach:

    • This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
    • You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.

提交回复
热议问题