Why does re.sub in Python not work correctly on this test case?

后端 未结 4 1730
感动是毒
感动是毒 2021-01-21 06:47

Try this code.

test = \' az z bz z z stuff z  z \'
re.sub(r\'(\\W)(z)(\\W)\', r\'\\1_\\2\\3\', test)

This should replace all stand-alone z\'s w

相关标签:
4条回答
  • 2021-01-21 07:20

    The reason why it does that is that you get an overlapping match; you need to not match the extra character - there are two ways you can do this; one is using \b, the word boundary, as suggested by others, the other is using a lookbehind assertion and a lookahead assertion. (If reasonable, as it should probably be, use \b instead of this solution. This is mainly here for educational purposes.)

    >>> re.sub(r'(?<!\w)(z)(?!\w)', r'_\1', test)
    ' az _z bz _z _z stuff _z  _z '
    

    (?<!\w) makes sure there wasn't \w before.

    (?!\w) makes sure there isn't \w after.

    The special (?...) syntax means they aren't groups, so the (z) is \1.


    As for a graphical explanation of why it fails:

    The regex is going through the string doing replacement; it's at these three characters:

    ' az _z bz z z stuff z  z '
              ^^^
    

    It does that replacement. The final character has been acted upon, so its next step is approximately this:

    ' az _z bz _z z stuff z  z '
                  ^^^ <- It starts matching here.
                 ^ <- Not this character, it's been consumed by the last match
    
    0 讨论(0)
  • 2021-01-21 07:23

    If your goal is to make sure you only match z when it's a standalone word, use \b to match word boundaries without actually consuming the whitespace:

    >>> re.sub(r'\b(z)\b', r'_\1', test)
    ' az _z bz _z _z stuff _z  _z '
    
    0 讨论(0)
  • 2021-01-21 07:32

    Use this:

    test = ' az z bz z z stuff z  z '
    re.sub(r'\b(z)\b', r'_\1', test)
    
    0 讨论(0)
  • 2021-01-21 07:37

    You want to avoid capturing the whitespace. Try using the 0-width word break \b, like this:

    re.sub(r'\bz\b', '_z', test)
    
    0 讨论(0)
提交回复
热议问题