Python: getting correct string length when it contains surrogate pairs

前端 未结 3 474
太阳男子
太阳男子 2021-01-04 04:42

Consider the following exchange on IPython:

In [1]: s = u\'華袞與緼         


        
相关标签:
3条回答
  • 2021-01-04 05:21

    I think this has been fixen in 3.3. See:

    http://docs.python.org/py3k/whatsnew/3.3.html
    http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

    0 讨论(0)
  • 2021-01-04 05:26

    You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.

    0 讨论(0)
  • 2021-01-04 05:36

    I make a function to do this on Python 2:

    SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
    def unicodeLen(s):
      return len(SURROGATE_PAIR.sub('.', s))
    

    By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

    0 讨论(0)
提交回复
热议问题