I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.
\"H
Pro-Tip: Use string.translate
for the fastest string operations Python has.
Some proof...
First, the slow way (sorry pprzemek):
>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
... res = [s]
... for sep in seps:
... s, res = res, []
... for seq in s:
... res += seq.split(sep)
... return res
...
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552
Next, we use re.findall()
(as given by the suggested answer). MUCH faster:
>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094
Finally, we use translate
:
>>> from string import translate,maketrans,punctuation
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934
Explanation:
string.translate
is implemented in C and unlike many string manipulation functions in Python, string.translate
does not produce a new string. So it's about as fast as you can get for string substitution.
It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans()
convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!
Next, we use good old split()
. split()
by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()
!