I want to split a string using -
, +=
, ==
, =
, +
, and white-space as delimiters. I want to keep the delimiter
This pattern is more in line with what you want:
\s*(\-|\+\=|\=\=|\=|\+)\s*
You will still get an empty string between each split, though, as you should expect.
re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)
['hello', '', '', '', '', '', '', '', 'there']
Note the empty strings in between -
and +=
, +=
and ==
, etc.
As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+)
instead of (?:\-|\+\=|\=\=|\=|\+)
, the bits that the capture group matches are interspersed:
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
None
corresponds to where the \s+
half of your pattern was matched; in those cases, the capture group captured nothing.
From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard None
s and empty strings:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
return [ x for x in pattern.split(s) if x ]
One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.
Why is it behaving this way?
According to the documentation for re.split:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
This is literally correct: if capturing parentheses are used, then the text of all groups are returned, whether or not they matched anything; the ones which didn't match anything return None
.
As always with split
, two consecutive delimiters are considered to separate empty strings, so you get empty strings interspersed.
how might I change it to get what I want?
The simplest solution is to filter the output:
filter(None, pattern.split(s))
Perhaps re.findall
would be more suitable for you?
>>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+ there")
['hello', '-', '+=', '==', '=', '=', '+', 'there']
Try this:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
x = pattern.split(s)
result = []
for item in x:
if item != '' and item != None:
result.append(item)
return result
print(tokenize("hello-+==== =+ there"))