I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.
\"H
So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re
module:
>>> import re # Will be splitting on: , - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
where:
[…]
matches one of the separators listed inside,\-
in the regular expression is here to prevent the special interpretation of -
as a character range indicator (as in A-Z
),+
skips one or more delimiters (it could be omitted thanks to the filter()
, but this would unnecessarily produce empty strings between matched separators), andfilter(None, …)
removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).This re.split()
precisely "splits with multiple separators", as asked for in the question title.
This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).
The re
module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!