Split Strings into words with multiple word boundary delimiters

前端 未结 30 2705
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  -上瘾入骨i
    2020-11-21 06:04

    First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.

    My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).

    Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.

    Option 1 - re.sub

    I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.

    import re
    
    my_str = "Hey, you - what are you doing here!?"
    
    words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())
    

    In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.

    Option 2 - str.replace

    This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.

    my_str = "Hey, you - what are you doing here!?"
    
    replacements = (',', '-', '!', '?')
    for r in replacements:
        my_str = my_str.replace(r, ' ')
    
    words = my_str.split()
    

    It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)

    Option 3 - functools.reduce

    (In Python 2, reduce is available in global namespace without importing it from functools.)

    import functools
    
    my_str = "Hey, you - what are you doing here!?"
    
    replacements = (',', '-', '!', '?')
    my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
    words = my_str.split()
    

提交回复
热议问题