PHP regex: is there anything wrong with this code?

前端 未结 2 801
长发绾君心
长发绾君心 2020-12-20 07:16

preg_replace_callback(\'#<(code|pre)([^>]*)>(((?!#si\', \'self::replaceit\', $text);

?

I\'m trying to r

相关标签:
2条回答
  • 2020-12-20 07:31

    I'd like to help. I've seen this problem before!

    Your regex looks logically A-Ok, but when applied to a large-ish subject string, it is likely resulting in a lot of recursive backtracking, which is causing a stack-overflow in the PCRE engine. This overflow results in a segmentation fault and a crashing of the PCRE executable (either Apache or PHP), without warning. (The symptom is the "connection closed by remote server" message.) This un-handled crashing is due to PHP's poor choice of a default setting for the pcre.recursion_limit parameter (it defaults to 100,000 which is too high). First lets see if this is, in fact, part of the problem.

    Add the following code to your script:

    // Place this at the top of the script
    ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
    
    $re = '#<(code|pre)([^>]*)>(((?!</?\1).)*|(?R))*</\1>#si';
    $text = preg_replace_callback($re, 'self::replaceit', $text);
    // Check the return value for NULL which indicates a PCRE error.
    if ($text === null) exit("PCRE Error! Subject too large or complex.");
    

    With this in place you should no longer get the "connection closed" message but rather the PCRE error exit message. Note that the above setting of 524 is for a Win32 Apache httpd.exe (which has a 256KB stack). If you are running on a *nix server, you can up this value to 16777. The reasoning behind these numbers is that the recursion _limit value should be set to the executable stack size divided by 500. The WIn32 executable typically has a 256KB stack and *nix executables are typically built with an 8MB stack. Philip Hazel, (author of the excellent PCRE engine), has addressed this problem in detail. See: pcrestack man page

    Once you have done this, report back and I'll help with the next phase...

    (Note that it is NOT the (?R) expression causing the problem. More later.)

    The regex can be significantly improved (with regard to both solving this issue and improving its speed), by implementing Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique. This will dramatically reduce the number of necessary backtracks and likely solve your problem. Here is an improved (and thoroughly commented) version of your regex.

    $re = '% # Match an outermost PRE or CODE element.
        (               # $1: PRE/CODE element open tag
          <(code|pre)   # $2: Open tag name
          [^>]*+>       # Remainder of opening tag.
        )               # End $1: PRE/CODE element open tag.
        (               # $3: PRE/CODE element contents.
          (?:           # Group for contents alternatives
            (?R)        # Either a nested PRE or CODE element
          |             # Or non- <CODE, </CODE, <PRE or </PRE stuff.
            [^<]*+      # Begin: {normal* (special normal*)*} construct
            (?:         # See: "Mastering Regular Expressions".
              <         # {special} Match a <, but only if it is
              (?!/?\2)  # not the start of a nested or closing tag.
              [^<]*+    # match more {normal*}
            )*+         # Finish "Unrolling the loop"
          )*+           # Zero or more contents alternatives.
        )               # End $3: PRE/CODE element contents.
        (</\2>)         # $4: PRE/CODE element close tag
        %ix';
    

    However, this regex differs in that it uses four capture groups: $1 contains the whole element start tag, $2 contains the element tag name (which is used as a back reference), $3 contains the element contents, and $4 contains the element end tag.

    0 讨论(0)
  • 2020-12-20 07:50

    is there anything wrong with this code?

    Yes. You're trying to parse HTML with a regex. Tsk, tsk, tsk. Let's not summon Zalgo quite yet.

    You should be using the DOM.

    $doc = new DOMDocument();
    $doc->loadHTML($text);
    $code_tags = $doc->getElementsByTagName('code');
    $pre_tags = $doc->getElementsByTagName('pre');
    

    This will leave you with a set of Node lists, which you may process the contents of as you desire. If you're encountering &lt; and friends in the textContent (or when re-serializing the contents using saveXML), and you need the actual tags, consider htmlspecialchars_decode.


    Getting the first and last element in $code_tags, which is a DOM Node List:

    $first_code_tag = $code_tags->item(0);
    $last_code_tag = $code_tags->item( $code_tags->length - 1 );
    

    While you can treat a node list like an array inside a foreach, it isn't directly indexable, thus the whole checking for the length property and the use of the item method. Be aware that when there's only one item in the list, the first and last node will be identical. Thankfully you can just check to see if $code_tags->length is greater than one before checking the last in addition to the first.

    I'm not sure this is going to help you. Based off your other questions, it sounds like you're using this methodology to work on BBCode, and that you've turned the square brackets into less-than and greater-than. This isn't a problem, mind you, but it might make life interesting.

    Try inspecting the output of:

    echo $doc->saveXML($first_code_tag);
    

    to see if it's giving you the content that you expect.

    0 讨论(0)
提交回复
热议问题