PHP 7 preg_replace PREG_JIT_STACKLIMIT_ERROR with simple string

与世无争的帅哥 提交于 2020-01-11 13:16:07

问题


I know other people have submitted questions around this error, however I can't see how this regex or the subject string could be any simpler.

To me this is a bug, but before submitting it to PHP I thought I'd make sure and get help to see if this can be simpler.

Here's a small test script showing 2 strings; one with 1024 x's and one with 1023:

// 1024 x's
$str = '_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; 

// Outputs nothing (bug?)
echo preg_replace('/(?<=[^\w]|^)_([^_\n\t ](.|\n(?!\n))*?)_(?=[^\w]|$)/', '[i]${1}[/i]', $str); 

echo "\n\n";

// 1023 x's
$str = '_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; 

// Outputs the unchanged string as expected
echo preg_replace('/(?<=[^\w]|^)_([^_\n\t ](.|\n(?!\n))*?)_(?=[^\w]|$)/', '[i]${1}[/i]', $str);

As you can see, only with a slightly longer string (greater than 1024 characters) do we get an error. The strings that will be processed by this are going to be any length – they will be forum posts, news articles, etc.

Regex Explanation

Just trying to do some markdown parsing to convert a string like _I am italic_, to a legacy version of markup we're using from our old site in certain situations. The reasons/uses aren't important. What's important is that I believe this should work just fine, and in fact it does, like, everywhere else except PHP 7.

It should match these underscores only if that represent an independent word or sentence. It should not match the first underscore if it is preceded by any "word" based character, and it should not match the last underscore if it is followed by any "word" based character.

Environment: Centos 7, PHP: 7.1.6


回答1:


IMPORTANT NOTE:
The (.|\n)*? or (.|\r?\n)*? patterns should be avoided as they cause too much redundant backtracking. To match any char, you usually may use . with a DOTALL flag, or, in JavaScript, you may use [^] or [\s\S] constructs. See How do I match any character across multiple lines in a regular expression? for more details.

Current Issue

The (.|\n(?!\n))*? pattern is very inefficient and causes a lot of redundant backtracking when used not at the end of the pattern (where it does not make sense at all). The more it is located to the left of the pattern, the worse is the performance.

Since all it does is matches any char but a newline and then a newline that is not followed with another newline, in a lazy way, you may re-write the pattern as .*?(?:\R(?!\R).*?)*:

'~\b_([^_\n\t ].*?(?:\R(?!\R).*?)*)_\b~'

See the regex demo.

Note:

  • (?<=[^\w]|^) = \b because there is a _ (a word char) after the lookbehind
  • (?=[^\w]|$) = \b because there is a _ before the lookahead
  • .*?(?:\R(?!\R).*?)* - matches:
    • .*? - any 0+ chars other than line break chars, as few as possible, then
    • (?:\R(?!\R).*?)* - zero or more sequences of:
      • \R(?!\R) - a line break sequence not followed with another line break sequence (\R = \n, \r\n or \r)
      • .*? - any 0+ chars other than line break chars, as few as possible


来源:https://stackoverflow.com/questions/44778923/php-7-preg-replace-preg-jit-stacklimit-error-with-simple-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!