Split Strings into words with multiple word boundary delimiters

前端 未结 30 2629
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  不思量自难忘°
    2020-11-21 06:06

    Pro-Tip: Use string.translate for the fastest string operations Python has.

    Some proof...

    First, the slow way (sorry pprzemek):

    >>> import timeit
    >>> S = 'Hey, you - what are you doing here!?'
    >>> def my_split(s, seps):
    ...     res = [s]
    ...     for sep in seps:
    ...         s, res = res, []
    ...         for seq in s:
    ...             res += seq.split(sep)
    ...     return res
    ... 
    >>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
    54.65477919578552
    

    Next, we use re.findall() (as given by the suggested answer). MUCH faster:

    >>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
    4.194725036621094
    

    Finally, we use translate:

    >>> from string import translate,maketrans,punctuation 
    >>> T = maketrans(punctuation, ' '*len(punctuation))
    >>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
    1.2835021018981934
    

    Explanation:

    string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.

    It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

    Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!

提交回复
热议问题