matching unicode characters in python regular expressions

后端 未结 3 960
执笔经年
执笔经年 2020-11-29 05:55

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

         


        
相关标签:
3条回答
  • 2020-11-29 06:19

    In Python 2, you need the re.UNICODE flag and the unicode string constructor

    >>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
    u',./___-=+'
    >>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
    u',./___-=+'
    >>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
    u',./___-=+'
    >>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
    u',./___-=+'
    >>> re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
    u',./___\uff0c___-=+'
    >>> print re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)
    ,./___,___-=+
    

    (In the latter case, the comma is Chinese comma.)

    0 讨论(0)
  • 2020-11-29 06:32

    You need the UNICODE flag:

    m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)
    
    0 讨论(0)
  • 2020-11-29 06:35

    You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

    >>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
    {'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}
    

    This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode.

    0 讨论(0)
提交回复
热议问题