As an exercise I was trying to come up with a regex to evaluate simple algebra like
q = \'23 * 345 - 123+65\'
From here I want to get \'23\', \
Simply try this.
import re
q = '23 * 345 - 123+65'
regexparse = r'(\d+)|[-+*/]'
for i in re.finditer(regexparse, q):
print i.group(0)
output:
23
*
345
-
123
+
65
This is your regex:
(\d+\s*(\*|\/|\+|\-)\s*)+(\d+\s*)
(\d+\s*(\*|\/|\+|\-)\s*)
will match the first part of your expression: 23 *
and store *
in the second group.
Then the +
makes it repeat, but because repeating capture groups retain only their last match, it will discard 23 *
and *
and instead match 345 -
and -
in the second group.
The +
works again on the next repeat to discard the last capture and instead capture 123+
in the first group and +
in the second.
Next, +
cannot repeat any more, so it stops, and (\d+\s*)
starts matching to get 65
.
The fact that repeating capture groups store only the last capture is how regex works by design and is like this in all regex engines AFAIK.
Further elaboration:
There's a difference between matching repeatedly and capturing repeatedly. Try: (\d)+
on 12345
and you will see that only 5
will be captured. It's like that because you the paren is assigned a particular group capture. The first group is assigned group 1 and if you have many captures for group 1, you can only keep 1 and that's the last. This is how regex works, unfortunately, as per the docs:
If a group matches multiple times, only the last match is accessible
If you want to get your desired output, you might use re.findall
and match with \d+|[+/*-]
:
import re
q = '23 * 345 - 123+65'
regexparse = r'\d+|[+/*-]'
elem = re.findall(regexparse, q)
print(elem)
#=> ['23', '*', '345', '-', '123', '+', '65']
Your regex is confusing. Better to use re.split()
for this purpose:
q = '23 * 345 - 123+65'
print re.split('\s*([-+/*])\s*', q)
Outputs:
['23', '*', '345', '-', '123', '+', '65']
I can only speak of regex in general, as I don't know python, but your problem is that in
(\d+\s*[\*/+-]\s*)+(\d+\s*)
This portion
(\d+\s*[\*/+-]\s*)+
Is being repeated and when it's completely done evaluating, you only see the final one.