Why is a character class faster than alternation?

旧街凉风 提交于 2019-11-26 09:54:35

问题


It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation\'s nature of alternation?


回答1:


This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:

Pattern: (r|f)at
Match string: carat

Pattern: [rf]at
Match string: carat


But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.




回答2:


Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).

The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.

There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".



来源:https://stackoverflow.com/questions/22132450/why-is-a-character-class-faster-than-alternation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!