regex, problem with backreference in pattern with preg_match_all

折月煮酒 提交于 2019-12-12 19:12:23

问题


i wonder what is the problem with the backreference here:

preg_match_all('/__\((\'|")([^\1]+)\1/', "__('match this') . 'not this'", $matches);

it is expected to match the string between __('') but actually it returns:

match this') . 'not this

any ideas?


回答1:


Make your regex ungreedy:

preg_match_all('/__((\'|")([^\1]+)\1/U', "__('match this') . 'not this'", $matches)



回答2:


You can't use a backreference inside a character class because a character class matches exactly one character, and a backreference can potentially match any number of characters, or none.

What you're trying to do requires a negative lookahead, not a negated character class:

preg_match_all('/__\(([\'"])(?:(?!\1).)+\1\)/',
    "__('match this') . 'not this'", $matches);

I also changed your alternation - \'|" - to a character class - [\'"] - because it's much more efficient, and I escaped the outer parentheses to make them match literal parentheses.


EDIT: I guess I need to expand that "more efficient" remark. I took the example Friedl used to demonstrate this point and tested it in RegexBuddy.

Applied to target text abababdedfg,
^[a-g]+$ reports success after three steps, while
^(?:a|b|c|d|e|f|g)+$ takes 55 steps.

And that's for a successful match. When I try it on abababdedfz,
^[a-g]+$ reports failure after 21 steps;
^(?:a|b|c|d|e|f|g)+$ takes 99 steps.

In this particular case the impact on performance is so trivial it's not even worth mentioning. I'm just saying whenever you find yourself choosing between a character class and an alternation that both match the same things, you should almost always go with the character class. Just a rule of thumb.




回答3:


I'm suprised it didn't give you an unbalance parenthesis error message.

 /
   __
   (
       (\'|")
       ([^\1]+)
       \1
 /

This [^\1] will not take the contents of capture buffer 1 and put it into a character
class. It is the same as all characters that are NOT '1'.

Try this:

/__\(('|").*?\1\).*/

You can add an inner capturing parenthesis to just capture whats between quotes:
/__\(('|")(.*?)\1\).*/

Edit: If no inner delimeter is allowed, use Qtax regex.
Since, ('|").*?\1 even though non-greedy, will still match all up to the trailing anchor. In this case __('all'this'will"match'), and its better to use ('[^']*'|"[^"]*) as




回答4:


You can use something like: /__\(("[^"]+"|'[^']+')\)/



来源:https://stackoverflow.com/questions/6050427/regex-problem-with-backreference-in-pattern-with-preg-match-all

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!