Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

醉酒当歌 提交于 2019-12-19 17:45:50

问题


Using stringr i tried to detect a sign at the end of a string as follows:

str_detect("my text €", "€\\b") # FALSE

Why is this not working? It is working in the following cases:

str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong about the €\\b-regex? The regex €$ is working in all cases...


回答1:


When you use base R regex functions without perl=TRUE, TRE regex flavor is used.

It appears that TRE word boundary:

  • When used after a non-word character matches the end of string position, and
  • When used before a non-word character matches the start of string position.

See the R tests:

> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
> 

This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):

There are three different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.




回答2:


\b

is equivalent to

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

which is to say it matches

  • between a word char and a non-word char,
  • between a word char and the start of the string, and
  • between a word char and the end of the string.

is a symbol, and symbols aren't word characters.

$ uniprops €
U+20AC <€> \N{EURO SIGN}
    \pS \p{Sc}
    All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).

(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))


来源:https://stackoverflow.com/questions/41174959/why-does-is-this-end-of-line-b-not-recognised-as-word-boundary-in-stringr-ic

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!