问题
I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux).
This works ok (non-unicode token [e.g. guion:]
>>> import regex
>>> s = u"guion: El computador. Ángel."
>>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guion: El computador. **Ángel**.
But the same logic fails to spot accented tokens [e.g. guión:]:
>>> s = u"guión: El computador. Ángel."
>>> p = regex.compile( ur'(?<!guión:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guión: **El** computador. **Ángel**.
The expected outcome would be:
guión: El computador. **Ángel**.
In regex101 the code works just fine (in 'pcr (php)' flavor, instead of 'python' flavor, since for some reason the first seems to give results more similar to those of command line regex package in python).
Is it due to the python version I'm using: 2.7.12 instead of python 3?. It is most likely I am misunderstanding something. Thanks in advance for any directions.
After plenty of bugs and weird outcomes, I've come to realize that:
The
regex
package is the way to go, instead ofre
due to a better unicode support (for instance, provides differentiation of upper and lowercase unicode chars).The
regex.U
flag must be set. (regex.X
just allows spaces and comments for the sake of clarity)u''
unicode strings andr''
raw strings can be combined at the same time:ur''
\p{Lu}
and\p{Ll}
match unicode uppercase and lowercase chars, respectively.
来源:https://stackoverflow.com/questions/50306651/python-using-regex-and-tokens-with-accented-chars-negative-lookbehind