Vim / sed regex backreference in search pattern

房东的猫 提交于 2019-12-07 13:30:46

问题


Vim help says that:

\1      Matches the same string that was matched by     */\1* *E65*
        the first sub-expression in \( and \). {not in Vi}
        Example: "\([a-z]\).\1" matches "ata", "ehe", "tot", etc. 

It looks like the backreference can be used in search pattern. I started playing with it and I noticed behavior that I can't explain. This is my file:

<paper-input label="Input label"> Some text </paper-input>
<paper-input label="Input label"> Some text </paper-inputa>
<aza> Some text </az>
<az> Some text </az>
<az> Some text </aza>

I wanted to match the lines where the opening and closing tags are matching i.e.:

<paper-input label="Input label"> Some text </paper-input>
<az> Some text </az>

And my test regex is:

%s,<\([^ >]\+\).*<\/\1>,,gn

But this matches lines: 1, 3 and 4. Same thing with sed:

$ sed -ne 's,<\([^ >]\+\).*<\/\1>,\0,p' file
<paper-input label="Input label"> Some text </paper-input>
<aza> Some text </az>
<az> Some text </az>

This: <\([^ >]\+\) should be greedy and when trying to match it without \1 at the end then all the groups are correct. But when I add \1 it seems that <\([^ >]\+\) becomes not greedy and it tries to force the match in 3rd line. Can someone explain why it matches 3rd line:

<aza> Some text </az>

This is also a regex101 demo

NOTE This is not about the regex itself (probably there is other way to do it) but about the behavior of that regex.


回答1:


To understand why your regex behaves the way it does you need to understand what a backtracking regex engine does.

The engine will greedily match and consume as many characters as it can. But if it doesn't find a match it goes back and tries to find a different match that still satisfies the pattern.

%s,<\([^ >]\+\).*<\/\1>,,gn

For line three <aza> Some text </az>,

The regex engine looks at \1 = aza. and sees if .*</aza> matches the rest of the string. It doesn't so it chooses something else for \1. The next time it chooses \1 = az and sees if .*</az> matches the rest of the string and it does. So the string matches

(This is a simplified version. I skipped over the fact that .* can potentially do a lot of backtracking itself)


Solving it is as easy as adding an anchor in the regex stops the regex from searching for other values that could satisfy \1. In this case matching a space or > is sufficient.




回答2:


You need to add \> to indicate end of word. There may be other solutions with 0-width patterns, but it'll complicates things.

Also, your separator is ,, not /

Which gives:

%s,<\([^ >]\+\)\>.*</\1>,,gn



回答3:


Currently the reason why line 3 (<aza>) is showing up as a match is that the .* term in your regex can match across multiple lines. So line 3 matches because line 5 has the closing tag. To correct this, force the regex to find a matching closing tag on the same line only:

%s,<\([^ >]\+\)[^\n]*?<\/\1>,,gn
               ^^^^^ use [^\n]* instead of .*


来源:https://stackoverflow.com/questions/39380964/vim-sed-regex-backreference-in-search-pattern

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!