From my previous questions Why under locale-pragma word characters do not match? and How to change nested quotes I learnt that when dealing with UTF-8 data you can\'t trust
You should be using negative lookarounds:
(?<!\p{Word})(\p{Word}+)(?!\p{Word})
The positive lookarounds fail at the start or end of the string because they require a non-word character to be present. The negative lookarounds work in both cases.
Since the character after the position of the \b
is either some punctuation or "
(to be safe, please double check that \p{Word}
does not match any of them), it falls into the case \b\W
. Therefore, we can emulate \b
with:
(?<=\p{Word})
I am not familiar with Perl, but from what I tested here, it seems that \w
(and \b
) also works nicely when the encoding is set to UTF-8.
$sentence =~ s/
"(
[\w\.]+?
.*?\b[\.,?!»]*?
)"
/«$1»/xg;
If you move up to Perl 5.14 and above, you can set the character set to Unicode with u
flag.
You can use this general strategy to construct a boundary corresponding to a character class. (Like how \b
word boundary definition is based on the definition of \w
).
Let C
be the character class. We would like to define a boundary that is based on the character class C.
The construction below will emulate boundary in front when you know the current character belongs to C
character class (equivalent to (\b\w)
):
(?<!C)C
Or behind (equivalent to \w\b
):
C(?!C)
Why negative look-around? Because positive look-around (with the complementary character class) will also assert that there must be a character ahead/behind (assert width ahead/behind at least 1). Negative look-around will allow for the case of beginning/ending of the string without writing a cumbersome regex.
For \B\w
emulation:
(?<=C)C
and similarly \w\B
:
C(?=C)
\B
is the direct opposite of \b
, therefore, we can just flip the positive/negative look-around to emulate the effect. It also makes sense - a non-boundary can only be formed when there are more character ahead/behind.
Other emulations (let c
be the complement character class of C
):
\b\W
: (?<=C)c
\W\b
: c(?=C)
\B\W
: (?<!C)c
\W\B
: c(?!C)
For the emulation of a standalone boundary (equivalent to \b
):
(?:(?<!C)(?=C)|(?<=C)(?!C))
And standalone non-boundary (equivalent to \B
):
(?:(?<!C)(?!C)|(?<=C)(?=C))