Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

问题

Using stringr i tried to detect a € sign at the end of a string as follows:

str_detect("my text €", "€\\b") # FALSE

Why is this not working? It is working in the following cases:

str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong about the €\\b-regex? The regex €$ is working in all cases...

回答1:

When you use base R regex functions without perl=TRUE, TRE regex flavor is used.

It appears that TRE word boundary:

When used after a non-word character matches the end of string position, and
When used before a non-word character matches the start of string position.

See the R tests:

> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>

This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):

There are three different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

回答2:

\b

is equivalent to

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

which is to say it matches

between a word char and a non-word char,
between a word char and the start of the string, and
between a word char and the end of the string.

€ is a symbol, and symbols aren't word characters.

$ uniprops €
U+20AC <€> \N{EURO SIGN}
    \pS \p{Sc}
    All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).

(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

来源：https://stackoverflow.com/questions/41174959/why-does-is-this-end-of-line-b-not-recognised-as-word-boundary-in-stringr-ic

标签

regex

pcre

stringr