问题
The following code:
>>> text = "imagine a new *world*, a *magic* world"
>>> pattern = re.compile(r'\*(.*?)\*')
>>> pattern.sub(r"<b>\1<\b>", text)
outputs:
imagine a new <b>world<\x08>, a <\b>magic<\x08> world
I have two problems here,
1.) I don't understand why does back reference '\1'
changes the magic part of the text?
I have read that '\1'
refers to the first group which is captured.
2.) Why does <\b>
outputs <\x08>
even after using 'r'
as prefix. I dosen't happen with '\n'
.
回答1:
sub replaces all matches, not just the first one. From the documentation:
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. [...] The optional argument
count
is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced.\b
is an escape sequence (for backspace). You should escape it with an other\
:r'<b>\1<\\b>'
Used as:
In [4]: pattern.sub(r'<b>\1<\\b>', text) Out[4]: 'imagine a new <b>world<\\b>, a <b>magic<\\b> world'
Escape sequences are interpreted in two different moments:
- By the python compiler, when creating the bytecode, which has to decide which characters to put into the strings.
- By the
re
engine, when performing the substitution.
To understand why double escaping is required you can try to add one backslash at a time:
In [18]: print(pattern.sub('<b>\\1<\b>', text))
imagine a new <b>world>, a <b>magic> world
In [19]: print(pattern.sub('<b>\\1<\\b>', text))
imagine a new <b>world>, a <b>magic> world
In [20]: print(pattern.sub('<b>\\1<\\\b>', text))
imagine a new <b>world<>, a <b>magic<> world
In [21]: print(pattern.sub('<b>\\1<\\\\b>', text))
imagine a new <b>world<\b>, a <b>magic<\b> world
In [18]
the \b
is interpreted by the python compiler, so a real backspace character is put in the string (and, as you can see, when replacing it deletes the previous <
character)
In [19]
the \\
is interpreted as one escaped \
but, afterwards, the re
engine sees that you want to replace the some text that contains and escape sequence and reinterprets it, thus yielding the same result as [18]
.
In [20]
the \\
is interpreted as one escaped \
and the final \b
as a backspace. The result is that, when replacing, the backspace deletes the \
.
In [21]
the four \\\\
are interpreted as two escape sequences, which the re
engine interprets as a single \
followed by a b
(the expected result). Using four \
is equivalent to using raw string literals plus one escaping.
回答2:
I went to IRC and people there told me that every time a group is captured with repeated match string, the bacreference is overwritten.
来源:https://stackoverflow.com/questions/24094417/what-group-does-backreference-refers-to-when-used-with-sub-operation