matching unicode characters in python regular expressions

半城伤御伤魂 提交于 2019-11-26 09:39:32

问题


I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

>>> import re
>>> m = re.match(r\'^/by_tag/(?P<tag>\\w+)/(?P<filename>(\\w|[.,!#%{}()@])+)$\', \'/by_tag/xmas/xmas1.jpg\')
>>> print m.groupdict()
{\'tag\': \'xmas\', \'filename\': \'xmas1.jpg\'}

All is well, then I try something with Norwegian characters in it ( or something more unicode-like ):

>>> m = re.match(r\'^/by_tag/(?P<tag>\\w+)/(?P<filename>(\\w|[.,!#%{}()@])+)$\', \'/by_tag/påske/øyfjell.jpg\')
>>> print m.groupdict()
Traceback (most recent call last):
File \"<interactive input>\", line 1, in <module>
AttributeError: \'NoneType\' object has no attribute \'groupdict\'

How can I match typical unicode characters, like øæå? I`d like to be able to match those characters as well, in both the tag-group above and the one for filename.


回答1:


You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode.




回答2:


You need the UNICODE flag:

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)



回答3:


In Python 2, you need the re.UNICODE flag and the unicode string constructor

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
,./___,___-=+

(In the latter case, the comma is Chinese comma.)



来源:https://stackoverflow.com/questions/5028717/matching-unicode-characters-in-python-regular-expressions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!