Code:
str = \'
A
B\'
print(re.sub(r\'\\w$\', \'\', str))
It is expected to return
The non-greediness won't start later on like that. It matches the first <br
and will non-greedily match the rest, which actually need to go to the end of the string because you specify the $
.
To make it work the way you wanted, use
/<br[^<]*?>\w$/
but usually, it is not recommended to use regex to parse HTML, as some attribute's value can have <
or >
in it.
Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:
<br
at the start of the string..*?
is ignored for now, it is lazy.>
, and succeeds.\w
and fails. Now it's interesting - the engine starts backtracking, and sees the .*?
rule. In this case, .
can match the first >
, so there's still hope for that match.>\w
can match, but $
fails. Again, the engine comes back to the lazy .*
rule, and keeps matching, until it matches<br><br />A<br />B
Luckily, there's an easy solution: By replacing <br[^>]*>\w$
you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain >
characters, but I assume it's just an example.