Regex PHP. Reduce steps: limited by fixed width Lookbehind

若如初见. 提交于 2019-12-08 02:38:18

问题


I have a regex that will be used to match @users tags.

I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character. And the blocks are allowed to surround a tag, just like punctuation.

I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for @, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.

How could I do my regex more efficient with fewer steps?

Here is a simplified version of it, in the Regex101 link there is the full regex.

(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*@([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])

https://regex101.com/r/lTPUOf/4/


回答1:


A rule of thumb:

Do not let engine make an attempt on matching each single one character if there are some boundaries.

The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:

(?:[^@^]++|[@^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
    (\^[34biu78])*+@([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))

Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.

What is the logic?

We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^@^]++ matches up to a @ or ^ symbols (desired characters) and [@^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.

You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).

See live demo here



来源:https://stackoverflow.com/questions/54074550/regex-php-reduce-steps-limited-by-fixed-width-lookbehind

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!