Split Strings into words with multiple word boundary delimiters

前端 未结 30 2631
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  抹茶落季
    2020-11-21 06:07

    I had to come up with my own solution since everything I've tested so far failed at some point.

    >>> import re
    >>> def split_words(text):
    ...     rgx = re.compile(r"((?:(?

    It seems to be working fine, at least for the examples below.

    >>> split_words("The hill-tops gleam in morning's spring.")
    ['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
    >>> split_words("I'd say it's James' 'time'.")
    ["I'd", 'say', "it's", "James'", 'time']
    >>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
    ["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
    >>> split_words("google.com email@google.com split_words")
    ['google', 'com', 'email', 'google', 'com', 'split_words']
    >>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
    ['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
    >>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
    ['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
    

提交回复
热议问题