what group does backreference refers to when used with sub() operation?

 ̄綄美尐妖づ 提交于 2019-12-13 04:01:24

问题


The following code:

>>> text = "imagine a new *world*, a *magic* world"
>>> pattern = re.compile(r'\*(.*?)\*')
>>> pattern.sub(r"<b>\1<\b>", text)

outputs:

imagine a new <b>world<\x08>, a <\b>magic<\x08> world

I have two problems here,

1.) I don't understand why does back reference '\1' changes the magic part of the text? I have read that '\1' refers to the first group which is captured.

2.) Why does <\b> outputs <\x08> even after using 'r' as prefix. I dosen't happen with '\n'.


回答1:


  1. sub replaces all matches, not just the first one. From the documentation:

    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. [...] The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced.

  2. \b is an escape sequence (for backspace). You should escape it with an other \:

    r'<b>\1<\\b>'
    

    Used as:

    In [4]: pattern.sub(r'<b>\1<\\b>', text)
    Out[4]: 'imagine a new <b>world<\\b>, a <b>magic<\\b> world'
    

Escape sequences are interpreted in two different moments:

  • By the python compiler, when creating the bytecode, which has to decide which characters to put into the strings.
  • By the re engine, when performing the substitution.

To understand why double escaping is required you can try to add one backslash at a time:

In [18]: print(pattern.sub('<b>\\1<\b>', text))
imagine a new <b>world>, a <b>magic> world

In [19]: print(pattern.sub('<b>\\1<\\b>', text))
imagine a new <b>world>, a <b>magic> world

In [20]: print(pattern.sub('<b>\\1<\\\b>', text))
imagine a new <b>world<>, a <b>magic<> world

In [21]: print(pattern.sub('<b>\\1<\\\\b>', text))
imagine a new <b>world<\b>, a <b>magic<\b> world

In [18] the \b is interpreted by the python compiler, so a real backspace character is put in the string (and, as you can see, when replacing it deletes the previous < character)

In [19] the \\ is interpreted as one escaped \ but, afterwards, the re engine sees that you want to replace the some text that contains and escape sequence and reinterprets it, thus yielding the same result as [18].

In [20] the \\ is interpreted as one escaped \ and the final \b as a backspace. The result is that, when replacing, the backspace deletes the \.

In [21] the four \\\\ are interpreted as two escape sequences, which the re engine interprets as a single \ followed by a b (the expected result). Using four \ is equivalent to using raw string literals plus one escaping.




回答2:


I went to IRC and people there told me that every time a group is captured with repeated match string, the bacreference is overwritten.



来源:https://stackoverflow.com/questions/24094417/what-group-does-backreference-refers-to-when-used-with-sub-operation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!