How exactly do Regular Expression word boundaries work in PHP?

前端 未结 3 628
伪装坚强ぢ
伪装坚强ぢ 2020-11-27 07:55

I\'m currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running con

相关标签:
3条回答
  • 2020-11-27 08:19

    @ is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so @ is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).

    The following is similar to your regexes with the difference that instead of @, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:

    $r = preg_match("/\b(animal)/i", "somethinganimal", $match);
    var_dump($r, $match);
    
    $r = preg_match("/\b(animal)/i", "something!animal", $match);
    var_dump($r, $match);
    

    Output:

    int(0)
    array(0) {
    }
    int(1)
    array(2) {
      [0]=>
      string(6) "animal"
      [1]=>
      string(6) "animal"
    }
    
    0 讨论(0)
  • 2020-11-27 08:24

    One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].

    You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.

    It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

    0 讨论(0)
  • 2020-11-27 08:31

    The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your @ which is a \W character. So to match you need a word character before your @

    something@nimal
            ^^
    

    ==> Match because of the word boundary between g and @.

    something!@nimal
             ^^ 
    

    ==> NO match because between ! and @ there is no word boundary, both characters are \W

    0 讨论(0)
提交回复
热议问题