Why special characters like = or " break PHP regexp when using \b word boundary?

后端 未结 3 1577
臣服心动
臣服心动 2021-02-20 02:15

this is a follow up after reading How to specify "Space or end of string" and "space or start of string"?

From there, it states means to match a word

相关标签:
3条回答
  • 2021-02-20 02:39

    The problem is your use of \b which is a "word boundary." It's a placeholder for (^\w|\w$|\W\w|\w\W), where \w is a "word" character [A-Za-z0-9_] and \W is the opposite. The problem is that a " doesn't match the "word" characters, so the boundary condition is not met.

    Try using a \s instead, which will match any whitespace character.

    (?:^|\s)stackoverflow=""(?:\s|$)
    

    Characters inside a class are not interpreted, except for ^ used as a negation operator at the beginning of a class, and - as a range operator. This is why [ ^] wouldn't work for you. It was searching for a literal ^.

    $ php -a
    Interactive shell
    
    php > $input_line='
    php ' stackoverflow="" xxx
    php ' xxx stackoverflow="" xxx
    php ' xxx stackoverflow=""
    php ' ';
    php > echo preg_replace('/(?:^|\s)stackoverflow=""(?:\s|$)/', 'OK', $input_line);
    OKxxx
    xxxOKxxx
    xxxOK
    

    https://regex101.com/r/nP2aB8/1

    0 讨论(0)
  • 2021-02-20 02:40

    Background

    From the regular-expressions.info Word boundaries page:

    The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.

    There are three different positions that qualify as word boundaries:
    - Before the first character in the string, if the first character is a word character.
    - After the last character in the string, if the last character is a word character.
    - Between two characters in the string, where one is a word character and the other is not a word character.

    A very good explanation from nhahtdh post:

    A word boundary \b is equivalent to:

    (?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
    

    Which means:

    • Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).

      OR

    • Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).

    What's wrong with your regex

    The reason why \b is not suitable is because it requires a word/non-word character to appear after/before it which depends on the immediate context on both sides of \b. When you build a regex dynamically, you do not know which one to use, \B or \b. For your case, you could use '/\bstackoverflow=""\B/', but it would require a smart word/non-word boundary appending. However, there is an easier way: use negative lookarounds.

    Solution

    (?<!\w)stackoverflow=""(?!\w)
    

    See regex demo

    The regex contains negative lookarounds instead of word boundaries. The (?<!\w) lookbehind fails the match if there is a word character before stackoverflow="", and (?!\w) lookahead fails the match if stackoverflow="" is followed by a word character.

    What a word shorthand character class \w matches depends if you enable the Unicode modifier /u. Without it, a \w matches just [a-zA-Z0-9_]. You can lay further restrictions using the lookarounds.

    Demo

    PHP demo:

    $re = '/(?<!\w)stackoverflow=""(?!\w)/'; 
    $str = ",stackoverflow=\"\" xxx\nxxx stackoverflow=\"\" xxx\nxxx stackoverflow=\"\"\nstackoverflow=\"\" xxx"; 
    echo preg_replace($re, "NEW=\"\"", $str);
    

    NOTE: If you pass your string as a variable, remember to escape all special characters in it with preg_quote:

    $re = '/(?<!\w)' . preg_quote($keyword, '/') . '(?!\w)/'; 
    

    Here, notice the second argument to preg_quote, which is /, the regex delimiter char.

    0 讨论(0)
  • 2021-02-20 02:48

    " is, of course, not special.

    The word boundary, \b, OTOH, is. It looks for a word beginning/ending, and on the boundary it expects a word character - and the quote is not such a character.

    Remove it from the end or replace it with a negative look-ahead search for a word character.

    0 讨论(0)
提交回复
热议问题