PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

浪子不回头ぞ 提交于 2019-12-24 19:37:32

问题


who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs.

See the following example:

How it is now:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\ud83d', u'\ude00', u'\ud83d', u'\ude00']

My wish:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\U0001f600', u'\U0001f600']

Why am i actually need it, because i want to iterate over unicode string and get each iteration next unicode character.

See example:

for char in  regex.findall(".", te, re.UNICODE):
   print char

Thx you in advance=)


回答1:


Use a regular expression that matches a surrogate pair or anything. This will work in wide and narrow builds of Python 2, but isn't needed in a wide build since it doesn't use surrogate pairs.

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print re.findall(ur'[\ud800-\udbff][\udc00-\udfff]|.', te, re.UNICODE)
[u'A', u'\u5200', u'\U0001f600', u'\U0001f601', u'\u5100', u'Z']

This will still work in the latest Python 3, but also isn't needed because surrogate pairs are no longer used in Unicode strings (no wide or narrow build anymore):

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print(re.findall(r'[\ud800-\udbff][\udc00-\udfff]|.', te))
['A', '刀', '😀', '😁', '儀', 'Z']

Works without the surrogate match:

>>> print(re.findall(r'.', te))
['A', '刀', '😀', '😁', '儀', 'Z']

And then you can just iterate normally in Python 3:

>>> for c in te:
...     print(c)
...
A
刀
😀
😁
儀
Z

Note there is still an issue with graphemes (Unicode code point combinations that represent a single character. Here's a bad case:

>>> s = '👨🏻‍👩🏻‍👧🏻‍👦🏻'
>>> for c in s:
...     print(c)
...     
👨
🏻
‍
👩
🏻
‍
👧
🏻
‍
👦
🏻

The regex 3rd party module can match graphemes:

>>> import regex
>>> s = '👨🏻‍👩🏻‍👧🏻‍👦🏻'
>>> for c in regex.findall('\X',s):
...     print(c)
...     
👨🏻‍👩🏻‍👧🏻‍👦🏻


来源:https://stackoverflow.com/questions/51886803/python-re-dont-split-unicode-chars-into-surrogate-pairs-while-matching

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!