Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

前端 未结 2 1348
终归单人心
终归单人心 2021-01-12 17:52

Code:

str = \'

A
B\' print(re.sub(r\'\\w$\', \'\', str))

It is expected to return

相关标签:
2条回答
  • 2021-01-12 18:33

    The non-greediness won't start later on like that. It matches the first <br and will non-greedily match the rest, which actually need to go to the end of the string because you specify the $.

    To make it work the way you wanted, use

    /<br[^<]*?>\w$/
    

    but usually, it is not recommended to use regex to parse HTML, as some attribute's value can have < or > in it.

    0 讨论(0)
  • 2021-01-12 18:47

    Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:

    1. The regex engine matches <br at the start of the string.
    2. .*? is ignored for now, it is lazy.
    3. Try to match >, and succeeds.
    4. Try to match \w and fails. Now it's interesting - the engine starts backtracking, and sees the .*? rule. In this case, . can match the first >, so there's still hope for that match.
    5. This keep happening until the regex reaches the slash. Then >\w can match, but $ fails. Again, the engine comes back to the lazy .* rule, and keeps matching, until it matches<br><br />A<br />B

    Luckily, there's an easy solution: By replacing <br[^>]*>\w$ you don't allow matching outside of your tags, so it should replace the last occurrence.
    Strictly speaking, this doesn't work well for HTML, because tag attributes can contain > characters, but I assume it's just an example.

    0 讨论(0)
提交回复
热议问题