I need to split strings of data using each character from string.punctuation
and string.whitespace
as a separator.
Furthermore, I need for the
Try this:
import re
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")
Explanation from the Python documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
from itertools import chain, cycle, izip
s = "Now is the winter of our discontent"
words = s.split()
wordsWithWhitespace = list( chain.from_iterable( izip( words, cycle([" "]) ) ) )
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']
import re
import string
p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
string.punctuation + string.whitespace)))
print p.findall("Now is the winter of our discontent")
I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.
I'll explain the regexp since you're not familiar with it:
[...]
means any of the characters inside the square brackets[^...]
means any of the characters not inside the square brackets+
behind means one or more of the previous thingx|y
means to match either x
or y
So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall
method finds all non-overlapping matches of the pattern.
A different non-regex approach from the others:
>>> import string
>>> from itertools import groupby
>>>
>>> special = set(string.punctuation + string.whitespace)
>>> s = "One two three tab\ttabandspace\t end"
>>>
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
>>> split_combined
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
>>> split_separated
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
Could use dict.fromkeys
and .get
instead of the lambda
, I guess.
[edit]
Some explanation:
groupby
accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:
>>> groupby("sentence", lambda c: c in 'nt')
<itertools.groupby object at 0x9805af4>
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)
As @JonClements guessed, what I had in mind was
>>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
>>> s = "One two three tab\ttabandspace\t end"
>>> [''.join(g) for k,g in groupby(s, special.get)]
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
for the case where we were combining the separators. .get
returns None
if the value isn't in the dict.
Solution in linear (O(n)
) time:
Let's say you have a string:
original = "a, b...c d"
First convert all separators to space:
splitters = string.punctuation + string.whitespace
trans = string.maketrans(splitters, ' ' * len(splitters))
s = original.translate(trans)
Now s == 'a b c d'
. Now you can use itertools.groupby
to alternate between spaces and non-spaces:
result = []
position = 0
for _, letters in itertools.groupby(s, lambda c: c == ' '):
letter_count = len(list(letters))
result.append(original[position:position + letter_count])
position += letter_count
Now result == ['a', ', ', 'b', '...', 'c', ' ', 'd']
, which is what you need.
My take:
from string import whitespace, punctuation
import re
pattern = re.escape(whitespace + punctuation)
print re.split('([' + pattern + '])', 'now is the winter of')