PHP PREG_JIT_STACKLIMIT_ERROR - inefficient regex

寵の児 提交于 2019-12-18 06:59:49

问题


I am getting PREG_JIT_STACKLIMIT_ERROR error in preg_replace_callback() function when working with a bit longer string. Above 2000 characters it is not woking (above 2000 characters that match regex, not 2000 character string).
I've read already that it's caused by inefficient regex, but I can't make my regex simpler. Here's my regex:

/\{@([a-z0-9_]+)-((%?[a-z0-9_]+(:[a-z0-9_]+)*)+)\|(((?R)|.)*)@\}/Us

It should match strings like these:

1) {@if-statement|echo this|echo otherwise@}

2) {@if-statement:sub|echo this|echo otherwise@}

3) {@if-statement%statament2:sub|echo this@}

and also nested like this:

4) {@if-statement|echo this| {@if-statement2|echo this|echo otherwise@} @}

I've tried to simplify it to:

/\{@([a-z0-9_]+)-([a-z0-9_]+)\|(((?R)|.)*)@\}/Us

But it looks like error is caused by (((?R)|.)*) part. Any advice?

Code for testing:

$string = '{@if-is_not_logged_homepage|
<header id="header_home">
    <div class="in">
        <div class="top">
            <h1 class="logo"><a href="/"><img src="/img/logo-home.png" alt=""></a></h1>
            <div class="login_outer_wrapper">
                <button id="login"><div class="a"><i class="stripe"><i></i></i>Log in</div></button>
                <div id="login_wrapper">
                    <form method="post" action="{^login^}" id="form_login_global">
                        <div class="form_field no_description">
                            <label>{!auth:login_email!}</label>
                            <div class="input"><input type="text" name="form[login]"></div>
                        </div>
                        <div class="form_field no_description password">
                            <label>{!auth:password!}</label>
                            <div class="input"><input type="password" name="form[password]"></div>
                        </div>
                        <div class="remember">
                            <input type="checkbox" name="remember" id="remember_me_check" checked>
                            <label for="remember_me_check"><i class="fa fa-check" aria-hidden="true"></i>Remember</label>
                        </div>
                        <div class="submit_box">
                            <button class="btn btn_check">Log in</button>
                        </div>
                    </form>
                </div>
            </div>
        </div>
        <div class="content clr">
            <div class="main_menu">
                <a href="">
                    <i class="ico a"><i class="fa fa-lightbulb-o" aria-hidden="true"></i></i>
                    <span>Idea</span>
                    <div>&nbsp;</div>
                </a>
                <a href="">
                    <i class="ico b"><i class="fa fa-user" aria-hidden="true"></i></i>
                    <span>FFa</span>
                </a>
                <a href="">
                    <i class="ico c"><i class="fa fa-briefcase" aria-hidden="true"></i></i>
                    <span>Buss</span>
                </a>
            </div>
            <div class="text_wrapper">

                <div>
                    <div class="register_wrapper">
                        <a id="main_register" class="btn register">Załóż konto</a>
                        <form method="post" action="{^login^}" id="form_register_home">
                            <div class="form_field no_description">
                                <label>{!auth:email!}</label>
                                <div class="input"><input type="text" name="form2[email]"></div>
                            </div>
                            <div class="form_field no_description password">
                                <label>{!auth:password!}</label>
                                <div class="input tooltip"><input type="password" name="form2[password]"><i class="fa fa-info-circle tooltip_open" aria-hidden="true" title="{!auth:password_format!}"></i></div>

                            </div>
                            <div class="form_field terms no_description">
                                <div class="input">
                                    <input type="checkbox" name="form2[terms]" id="terms_check">
                                    <label for="terms_check"><i class="fa fa-check" aria-hidden="true"></i>Agree</label>
                                </div>
                            </div>
                            <div class="form_field no_description">
                                <div class="input captcha_wrapper">
                                    <div class="g-recaptcha" data-sitekey="{%captcha_public_key%}"></div>
                                </div>
                            </div>
                            <div class="submit_box">
                                <button class="btn btn_check">{!auth:register_btn!}</button>
                            </div>
                        </form>
                    </div>
                </div>
            </div>
        </div>
    </div>
</header>
@}';

$if_counter = 0;

$parsed_view = preg_replace_callback( '/\{@([a-z0-9_]+)-((%?[a-z0-9_]+(:[a-z0-9_]+)*)+)\|(((?R)|.)*)@\}/Us',
        function( $match ) use( &$if_counter ){
            return '<-{'. ( $if_counter ++ ) .'}->';
        }, $string );


var_dump($parsed_view); // NULL

回答1:


What is PCRE JIT?

Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed. Therefore, it is of most benefit when the same pattern is going to be matched many times.

and how does it work basically?

PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where the local data of the current node is pushed before checking its child nodes... When the compiled JIT code runs, it needs a block of memory to use as a stack. By default, it uses 32K on the machine stack. However, some large or complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.

By first quote you will understand JIT is an optional feature that is on by default in PHP [v7.*] PCRE. So you can easily turn it off: pcre.jit = 0 (it's not recommended though)

However, while receiving error code #6 of preg_* functions it means possibly JIT hits the stack size limit.

Since capturing groups consume more memory than non-capturing groups (even more memory is intended to be used as per type of quantifier(s) of clusters):

  1. Capturing group OP_CBRA (pcre_jit_compile.c:#1138) - (real memory is more than this):
case OP_CBRA:
case OP_SCBRA:
bracketlen = 1 + LINK_SIZE + IMM2_SIZE;
break;
  1. Non-capturing group OP_BRA (pcre_jit_compile.c:#1134) - (real memory is more than this):
case OP_BRA:
bracketlen = 1 + LINK_SIZE;
break;

Therefore changing capturing groups to non-capturing groups in your own RegEx makes it to give proper output (which I don't know exactly how much memory is saved by that)

But it seems you need capturing groups and they are necessary. Then you should re-write your RegEx for the sake of performance. Backtracking is almost everything in a RegEx that should be considered.

Update #1

Solution:

(?(DEFINE)
  (?<recurs>
    (?! {@|@} ) [^|] [^{@|\\]* ( \\.[^{@|\\]* )* | (?R)
  )
)
{@
(?<If> \w+)-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^{@|]*+ (?&recurs)* )
(?<False> [|] (?&recurs)* )?
\s*@}

Live demo

PHP code (watch backslash escaping):

preg_match_all('/(?(DEFINE)
  (?<recurs>
    (?! {@|@} ) [^|] [^{@|\\\\]* ( \\\\.[^{@|\\\\]* )* | (?R)
  )
)
{@
(?<If> \w+ )-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^{@|]*+ (?&recurs)* )
(?<False> [|] (?&recurs)* )?
\s*@}/x', $string, $matches);

This is your own RegEx that is optimized in a way to have least backtracking steps. So whatever was supposed to be matched by your own one is matched by this too.

RegEx without following nested if blocks:

{@
(?<If> \w+)-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^|\\]* (?: \\.[^|\\]* )* )
(?<False> [|] \X*)?
@}

Live demo

Most of quantifiers are written possessively (avoids backtrack) by appending + to them.




回答2:


The problem as you can see is that your pattern is inefficient. The main reasons are:

  • You use this kind of subpatterns: (a+)+b that is the best way for a catastrophic backtracking
  • You use this kind of subpatterns too: (a|b)+ that may be a good design except for a backtracking regex engine like pcre
  • You use the U modifier for an unknown reason that makes all your quantifiers non-greedy and generates a lot of useless tests

As an aside, there are too much useless capture groups that consumes memory for nothing. When you don't need a capture group, don't write it. If you really need to group elements, use a non-capturing group, but don't use non-capturing groups to make a pattern "more readable" (there are other ways to do that like named groups, free-spacing and comments).


If I understand well, you are trying to build a regex for preg_replace_callback to deal with the control statement of your template system. Since these control statements can be nested and a regex engine can't match several times the same substring, you have to choose between several strategies:

  1. You can write a recursive pattern to describe a conditional statement that eventually contains other conditional statements.

  2. You can write a pattern that matches only the innermost conditional statements. (In other words it forbids nested conditional statements.)

In the two cases, you need to parse the string several times until there's nothing to replace. (Note that you can also use a recursive function with the first strategy, but it makes things more complicated.)

Let's see the second way:

$pattern = '~
{@ (?<cond> \w+ ) - (?<stat> \w+ (?: % \w+ )* ) (?: : (?<sub> \w+ ) )? \|

# a "THEN" part that doesn\'t have nested conditional statements
(?<then> [^{|@]*+ (?: { (?!@) [^{|@]* | @ (?!}) [^{|@]* )*+ )

# optional "ELSE" part (the content is similar to the "THEN" part)
(?: \| (?<else> \g<then> ) )? (*SKIP) @}~x';

$parsed_view = $string;
$count = 0;

do {
    $parsed_view = preg_replace_callback($pattern, function ($m) {
        // do what you need here. The different captures can be
        // easily accessed with their names: $m['cond'], $m['stat']...
        // as defined in the pattern.
        return $result;
    }, $parsed_view, -1, $count);
} while ($count);

pattern demo

As you can see the problem of nested statements is solved with the do..while loop and the count parameter of preg_replace_callback to see if something is replaced.

This code isn't tested, but I'm sure you can complete it and eventually adapt it to your needs.


As an aside, there's a lot of template engines that already exists (and PHP is already a template engine). You can use them and avoid to create your own syntax. You can also take a look at their codes.



来源:https://stackoverflow.com/questions/39685883/php-preg-jit-stacklimit-error-inefficient-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!