Regex difference between word boundary end and edge

后端 未结 1 855
名媛妹妹
名媛妹妹 2021-01-14 09:31

The R help file for regex says

The symbols \\< and \\> respectively match the empty string at the beginning and end of a word. The symbol \\b mat

相关标签:
1条回答
  • 2021-01-14 10:09

    The difference between the \b and \< / \> is that \b can be used in PCRE regex patterns (when you specify perl=TRUE) and ICU regex patterns (stringr package).

    > s = "no where nowhere"
    > sub("\\<no\\>", "", s)
    [1] " where nowhere"
    > sub("\\<no\\>", "", s, perl=T) ## \> and \< do not work with PCRE
    [1] "no where nowhere"
    > sub("\\bno\\b", "", s, perl=T) ## \b works with PCRE
    [1] " where nowhere"
    
    > library(stringr)
    > str_replace(s, "\\bno\\b", "")
    [1] " where nowhere"
    > str_replace(s, "\\<no\\>", "")
    [1] "no where nowhere"
    

    The advantage of \< (always stands for the beginning of a word) and \> (always matches the end of a word) is that they are unambiguous. The \b may match both positions.

    One more thing to consider (refrence):

    POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

    0 讨论(0)
提交回复
热议问题