Try this code.
test = \' az z bz z z stuff z z \'
re.sub(r\'(\\W)(z)(\\W)\', r\'\\1_\\2\\3\', test)
This should replace all stand-alone z\'s w
The reason why it does that is that you get an overlapping match; you need to not match the extra character - there are two ways you can do this; one is using \b
, the word boundary, as suggested by others, the other is using a lookbehind assertion and a lookahead assertion. (If reasonable, as it should probably be, use \b
instead of this solution. This is mainly here for educational purposes.)
>>> re.sub(r'(?<!\w)(z)(?!\w)', r'_\1', test)
' az _z bz _z _z stuff _z _z '
(?<!\w)
makes sure there wasn't \w
before.
(?!\w)
makes sure there isn't \w
after.
The special (?...)
syntax means they aren't groups, so the (z)
is \1
.
As for a graphical explanation of why it fails:
The regex is going through the string doing replacement; it's at these three characters:
' az _z bz z z stuff z z '
^^^
It does that replacement. The final character has been acted upon, so its next step is approximately this:
' az _z bz _z z stuff z z '
^^^ <- It starts matching here.
^ <- Not this character, it's been consumed by the last match
If your goal is to make sure you only match z
when it's a standalone word, use \b
to match word boundaries without actually consuming the whitespace:
>>> re.sub(r'\b(z)\b', r'_\1', test)
' az _z bz _z _z stuff _z _z '
Use this:
test = ' az z bz z z stuff z z '
re.sub(r'\b(z)\b', r'_\1', test)
You want to avoid capturing the whitespace. Try using the 0-width word break \b
, like this:
re.sub(r'\bz\b', '_z', test)