I have the code:
import re
sequence=\"aabbaa\"
rexp=re.compile(\"(aa|bb)+\")
rexp.findall(sequence)
This returns [\'aa\']
your pattern
rexp=re.compile("(aa|bb)+")
matches the whole string aabbaa. to clarify just look at this
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(0)
'aabbaa'
Also no other substrings are to match then
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(1)
'aa'
so a findall will return the one substring only
>>> re.findall(re.compile("(aa|bb)+"),"aabbaa")
['aa']
>>>
let me explain what you are doing:
regex = re.compile("(aa|bb)+")
you are creating a regex which will look for aa
or bb
and then will try to find if there are more aa
or bb
after that, and it will keep looking for aa
or bb
until it doesnt find. since you want your capturing group to return only the aa
or bb
then you only get the last captured/found group.
however, if you have a string like this: aaxaabbxaa
you will get aa,bb,aa
because you first look at the string and find aa
, then you look for more, and find only an x
, so you have 1 group. then you find another aa
, but then you find a bb
, and then an x
so you stop and you have your second group which is bb
. then you find another aa
. and so your final result is aa,bb,aa
i hope this explains what you are DOING. and it is as expected. to get ANY group of aa
or bb
you need to remove the +
which is telling the regex to seek multiple groups before returning a match. and just have regex return each match of aa
or bb
...
so your regex should be:
regex = re.compile("(aa|bb)")
cheers.
The unwanted behaviour comes down to the way you formulate regualar expression:
rexp=re.compile("(aa|bb)+")
Parentheses (aa|bb)
forms a group.
And if we look at the docs of findall we will see this:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.**
As you formed a group, it mathced first aa
, then bb
, then aa
again (because of +
quantifier). So this group holds aa
in the end. And findall
returns this value in the list ['aa']
(as there is only one match aabbaa
of the whole expression, the list contains only one element aa
which is saved in the group).
From the code you gave, you seemed to want to do this:
>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']
(?: ...)
doesnt create any group, so findall
returns the match of the whole expression.
In the end of your question you show the desired output. This is achieved by just looking for aa
or bb
. No quantifiers (+
or *
) are needed. Just do it the way is in the Inbar Rose's answer:
>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']
I do not understand why you use + - it means 0 or 1 occurrence, and is usually used when you want find string with optional inclusion of substring.
>>> re.findall(r'(aa|bb)', 'aabbaa')
['aa', 'bb', 'aa']
work as expected