Python: Find equivalent surrogate pair from non-BMP unicode char

半腔热情 提交于 2019-11-27 14:29:06

You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

It's a little complex, but here's a one-liner to convert a single character:

>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'

To convert a mix of characters requires surrounding that expression with another:

>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!