How to emulate word boundary when using unicode character properties?

前端 未结 2 1195
后悔当初
后悔当初 2020-12-19 03:54

From my previous questions Why under locale-pragma word characters do not match? and How to change nested quotes I learnt that when dealing with UTF-8 data you can\'t trust

相关标签:
2条回答
  • 2020-12-19 04:03

    You should be using negative lookarounds:

    (?<!\p{Word})(\p{Word}+)(?!\p{Word})
    

    The positive lookarounds fail at the start or end of the string because they require a non-word character to be present. The negative lookarounds work in both cases.

    0 讨论(0)
  • 2020-12-19 04:05

    Since the character after the position of the \b is either some punctuation or " (to be safe, please double check that \p{Word} does not match any of them), it falls into the case \b\W. Therefore, we can emulate \b with:

    (?<=\p{Word})
    

    I am not familiar with Perl, but from what I tested here, it seems that \w (and \b) also works nicely when the encoding is set to UTF-8.

    $sentence =~ s/
      "(
        [\w\.]+?
        .*?\b[\.,?!»]*?
      )"
      /«$1»/xg;
    

    If you move up to Perl 5.14 and above, you can set the character set to Unicode with u flag.


    You can use this general strategy to construct a boundary corresponding to a character class. (Like how \b word boundary definition is based on the definition of \w).

    Let C be the character class. We would like to define a boundary that is based on the character class C.

    The construction below will emulate boundary in front when you know the current character belongs to C character class (equivalent to (\b\w)):

    (?<!C)C
    

    Or behind (equivalent to \w\b):

    C(?!C)
    

    Why negative look-around? Because positive look-around (with the complementary character class) will also assert that there must be a character ahead/behind (assert width ahead/behind at least 1). Negative look-around will allow for the case of beginning/ending of the string without writing a cumbersome regex.


    For \B\w emulation:

    (?<=C)C
    

    and similarly \w\B:

    C(?=C)
    

    \B is the direct opposite of \b, therefore, we can just flip the positive/negative look-around to emulate the effect. It also makes sense - a non-boundary can only be formed when there are more character ahead/behind.


    Other emulations (let c be the complement character class of C):

    • \b\W: (?<=C)c
    • \W\b: c(?=C)
    • \B\W: (?<!C)c
    • \W\B: c(?!C)

    For the emulation of a standalone boundary (equivalent to \b):

    (?:(?<!C)(?=C)|(?<=C)(?!C))
    

    And standalone non-boundary (equivalent to \B):

    (?:(?<!C)(?!C)|(?<=C)(?=C))
    
    0 讨论(0)
提交回复
热议问题