Split Strings into words with multiple word boundary delimiters

前端 未结 30 2609
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
相关标签:
30条回答
  • 2020-11-21 06:00

    Use replace two times:

    a = '11223FROM33344INTO33222FROM3344'
    a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')
    

    results in:

    ['11223', '33344', '33222', '3344']
    
    0 讨论(0)
  • 2020-11-21 06:03

    Here is the answer with some explanation.

    st = "Hey, you - what are you doing here!?"
    
    # replace all the non alpha-numeric with space and then join.
    new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
    # output of new_string
    'Hey  you  what are you doing here  '
    
    # str.split() will remove all the empty string if separator is not provided
    new_list = new_string.split()
    
    # output of new_list
    ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    
    # we can join it to get a complete string without any non alpha-numeric character
    ' '.join(new_list)
    # output
    'Hey you what are you doing'
    

    or in one line, we can do like this:

    (''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()
    
    # output
    ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    

    updated answer

    0 讨论(0)
  • 2020-11-21 06:03

    I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.

    def split_string(text, splitlist):
        for sep in splitlist:
            text = text.replace(sep, splitlist[0])
        return filter(None, text.split(splitlist[0])) if splitlist else [text]
    
    0 讨论(0)
  • 2020-11-21 06:04

    First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.

    My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).

    Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.

    Option 1 - re.sub

    I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.

    import re
    
    my_str = "Hey, you - what are you doing here!?"
    
    words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())
    

    In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.

    Option 2 - str.replace

    This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.

    my_str = "Hey, you - what are you doing here!?"
    
    replacements = (',', '-', '!', '?')
    for r in replacements:
        my_str = my_str.replace(r, ' ')
    
    words = my_str.split()
    

    It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)

    Option 3 - functools.reduce

    (In Python 2, reduce is available in global namespace without importing it from functools.)

    import functools
    
    my_str = "Hey, you - what are you doing here!?"
    
    replacements = (',', '-', '!', '?')
    my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
    words = my_str.split()
    
    0 讨论(0)
  • 2020-11-21 06:06

    Pro-Tip: Use string.translate for the fastest string operations Python has.

    Some proof...

    First, the slow way (sorry pprzemek):

    >>> import timeit
    >>> S = 'Hey, you - what are you doing here!?'
    >>> def my_split(s, seps):
    ...     res = [s]
    ...     for sep in seps:
    ...         s, res = res, []
    ...         for seq in s:
    ...             res += seq.split(sep)
    ...     return res
    ... 
    >>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
    54.65477919578552
    

    Next, we use re.findall() (as given by the suggested answer). MUCH faster:

    >>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
    4.194725036621094
    

    Finally, we use translate:

    >>> from string import translate,maketrans,punctuation 
    >>> T = maketrans(punctuation, ' '*len(punctuation))
    >>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
    1.2835021018981934
    

    Explanation:

    string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.

    It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

    Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!

    0 讨论(0)
  • 2020-11-21 06:07

    I like re, but here is my solution without it:

    from itertools import groupby
    sep = ' ,-!?'
    s = "Hey, you - what are you doing here!?"
    print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
    

    sep.__contains__ is a method used by 'in' operator. Basically it is the same as

    lambda ch: ch in sep
    

    but is more convenient here.

    groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.

    groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).

    This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)

    0 讨论(0)
提交回复
热议问题