Python: getting correct string length when it contains surrogate pairs

浪尽此生 提交于 2019-11-30 17:43:31

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

I make a function to do this on Python 2:

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

schilippe

You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!