Python: using regex and tokens with accented chars (negative lookbehind)

ⅰ亾dé卋堺 提交于 2019-12-11 05:08:32

问题


I need to detect capitalized words in Spanish, but only when they are not preceeded by a token, which can have unicode chars. (I'm using Python 2.7.12 in linux).

This works ok (non-unicode token [e.g. guion:]

>>> import regex
>>> s = u"guion: El computador. Ángel."
>>> p = regex.compile( r'(?<!guion:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
    guion: El computador. **Ángel**.

But the same logic fails to spot accented tokens [e.g. guión:]:

>>> s = u"guión: El computador. Ángel."
>>> p = regex.compile( ur'(?<!guión:\s) ( [\p{Lu}] [\p{Ll}]+ \b)' , regex.U | regex.X)
>>> print p.sub( r"**\1**", s)
guión: **El** computador. **Ángel**.

The expected outcome would be:

guión: El computador. **Ángel**.

In regex101 the code works just fine (in 'pcr (php)' flavor, instead of 'python' flavor, since for some reason the first seems to give results more similar to those of command line regex package in python).

Is it due to the python version I'm using: 2.7.12 instead of python 3?. It is most likely I am misunderstanding something. Thanks in advance for any directions.

After plenty of bugs and weird outcomes, I've come to realize that:

  • The regex package is the way to go, instead of re due to a better unicode support (for instance, provides differentiation of upper and lowercase unicode chars).

  • The regex.U flag must be set. ( regex.X just allows spaces and comments for the sake of clarity)

  • u'' unicode strings and r'' raw strings can be combined at the same time: ur''

  • \p{Lu} and \p{Ll} match unicode uppercase and lowercase chars, respectively.

来源:https://stackoverflow.com/questions/50306651/python-using-regex-and-tokens-with-accented-chars-negative-lookbehind

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!