Python: using regex and tokens with accented chars (negative lookbehind)

问题

I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux).

This works ok (non-unicode token [e.g. guion:]

>>> import regex
>>> s = u"guion: El computador. Ángel."
>>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
    guion: El computador. **Ángel**.

But the same logic fails to spot accented tokens [e.g. guión:]:

>>> s = u"guión: El computador. Ángel."
>>> p = regex.compile( ur'(?<!guión:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guión: **El** computador. **Ángel**.

The expected outcome would be:

guión: El computador. **Ángel**.

In regex101 the code works just fine (in 'pcr (php)' flavor, instead of 'python' flavor, since for some reason the first seems to give results more similar to those of command line regex package in python).

Is it due to the python version I'm using: 2.7.12 instead of python 3?. It is most likely I am misunderstanding something. Thanks in advance for any directions.

After plenty of bugs and weird outcomes, I've come to realize that:

The regex package is the way to go, instead of re due to a better unicode support (for instance, provides differentiation of upper and lowercase unicode chars).
The regex.U flag must be set. ( regex.X just allows spaces and comments for the sake of clarity)
u'' unicode strings and r'' raw strings can be combined at the same time: ur''
\p{Lu} and \p{Ll} match unicode uppercase and lowercase chars, respectively.

来源：https://stackoverflow.com/questions/50306651/python-using-regex-and-tokens-with-accented-chars-negative-lookbehind

标签

python

regex

python-unicode