I have the following code:
haystack = \"aaa months(3) bbb\"
needle = re.compile(r\'(months|days)\\([\\d]*\\)\')
instances = list(set(needle.findall(haystack)))
p
needle = re.compile(r'((?:months|days)\([\d]*\))')
fixes your problem.
you were capturing only the months|days part.
in this specific situation, this regex is a bit better:
needle = re.compile(r'((?:months|days)\(\d+\))')
this way you will only get results with a number, previously a result like months()
would work. if you want to ignore case for options like Months or Days, then also add the re.IGNORECASE
flag. like this:
re.compile(r'((?:months|days)\(\d+\))', re.IGNORECASE)
some explanation for the OP:
a regular expression is comprised of many elements, the chief among them is the capturing group. "()
" but sometimes we want to make groups without capturing, so we use "(?:)
" there are many other forms of groups, but these are the most common.
in this case, we surround the entire regular expression in a capturing group, because you are trying to capture everything, normally - any regular expression is automatically surrounded by a capturing group, but in this case, you specified one explicitly, so it did not surround your regular expression with an automatic capture group.
now that we have surrounded the entire regular expression with a capturing group, we turn the group we have into a non-capturing group by adding ?:
to the beginning, as shown above. we could also not have surrounded the entire regular expression and only turned the group into a non-capturing group, since as you saw, it will automatically turn the whole regular expression into a capturing group where non is present. i personally prefer explicit coding.
further information about regular expressions can be found here: http://docs.python.org/library/re.html