PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

问题

who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs.

See the following example:

How it is now:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\ud83d', u'\ude00', u'\ud83d', u'\ude00']

My wish:

$ te = u'\U0001f600\U0001f600'
$ flags1 = regex.findall(".", te, re.UNICODE)
$ flags1
>>> [u'\U0001f600', u'\U0001f600']

Why am i actually need it, because i want to iterate over unicode string and get each iteration next unicode character.

See example:

for char in  regex.findall(".", te, re.UNICODE):
   print char

Thx you in advance=)

回答1:

Use a regular expression that matches a surrogate pair or anything. This will work in wide and narrow builds of Python 2, but isn't needed in a wide build since it doesn't use surrogate pairs.

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print re.findall(ur'[\ud800-\udbff][\udc00-\udfff]|.', te, re.UNICODE)
[u'A', u'\u5200', u'\U0001f600', u'\U0001f601', u'\u5100', u'Z']

This will still work in the latest Python 3, but also isn't needed because surrogate pairs are no longer used in Unicode strings (no wide or narrow build anymore):

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print(re.findall(r'[\ud800-\udbff][\udc00-\udfff]|.', te))
['A', '刀', '😀', '😁', '儀', 'Z']

Works without the surrogate match:

>>> print(re.findall(r'.', te))
['A', '刀', '😀', '😁', '儀', 'Z']

And then you can just iterate normally in Python 3:

>>> for c in te:
...     print(c)
...
A
刀
😀
😁
儀
Z

Note there is still an issue with graphemes (Unicode code point combinations that represent a single character. Here's a bad case:

>>> s = '👨🏻‍👩🏻‍👧🏻‍👦🏻'
>>> for c in s:
...     print(c)
...     
👨
🏻
‍
👩
🏻
‍
👧
🏻
‍
👦
🏻

The regex 3rd party module can match graphemes:

>>> import regex
>>> s = '👨🏻‍👩🏻‍👧🏻‍👦🏻'
>>> for c in regex.findall('\X',s):
...     print(c)
...     
👨🏻‍👩🏻‍👧🏻‍👦🏻

来源：https://stackoverflow.com/questions/51886803/python-re-dont-split-unicode-chars-into-surrogate-pairs-while-matching

标签

python

regex

unicode

surrogate-pairs