How to get a reliable unicode character count in Python?

后端 未结 2 1646
一个人的身影
一个人的身影 2021-01-05 01:48

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u\'\\ud834\\udd0c\' (length 2) to the datasto

相关标签:
2条回答
  • 2021-01-05 02:27

    I know I can just encode it to UTF-8 and then decode again

    Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

    is there a more straightforward/efficient way?

    Well... you could do it manually with a regex, like:

    re.sub(
        u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
        lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
        s
    )
    

    Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!

    0 讨论(0)
  • 2021-01-05 02:28

    Unfortunately, the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter. See this question for examples.

    The distinction between "narrow" and "wide" is that "narrow" interpreters internally store 16-bit code units (UCS-2), whereas "wide" interpreters internally store 32-bit code units (UCS-4). Code points U+10000 and above (outside the basic-multilingual plane) have a len of two on "narrow" interpreters because two UCS-2 code units are needed to represent them (using surrogates), and that's what len measures. On "wide" builds only a single UCS-4 code unit is required for a non-BMP code point, so for those builds len is one for such code points.

    I have confirmed that the below handles all unicode strings whether or not they contain surrogate pairs, and works in CPython 2.7 both narrow and wide builds. (Arguably, specifying a string like u'\ud83d\udc4d' in a wide interpreter reflects an affirmative desire to represent a complete surrogate code point as distinct from a partial-character code unit and is therefore not automatically an error to be corrected, but I'm ignoring that here. It's an edge case and normally not a desired use case.)

    The @invoke trick used below is a way to avoid repeat computation without adding anything to the module's __dict__.

    invoke = lambda f: f()  # trick taken from AJAX frameworks
    
    @invoke
    def codepoint_count():
      testlength = len(u'\U00010000')  # pre-compute once
      assert (testlength == 1) or (testlength == 2)
      if testlength == 1:
        def closure(data):  # count function for "wide" interpreter
          u'returns the number of Unicode code points in a unicode string'
          return len(data.encode('UTF-16BE').decode('UTF-16BE'))
      else:
        def is_surrogate(c):
          ordc = ord(c)
          return (ordc >= 55296) and (ordc < 56320)
        def closure(data):  # count function for "narrow" interpreter
          u'returns the number of Unicode code points in a unicode string'
          return len(data) - len(filter(is_surrogate, data))
      return closure
    
    assert codepoint_count(u'hello \U0001f44d') == 7
    assert codepoint_count(u'hello \ud83d\udc4d') == 7
    
    0 讨论(0)
提交回复
热议问题