How to get a reliable unicode character count in Python?

后端未结

关注

 2  1648

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u\'\\ud834\\udd0c\' (length 2) to the datasto

相关标签:

2条回答

耶瑟儿～

2021-01-05 02:27
I know I can just encode it to UTF-8 and then decode again

Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

is there a more straightforward/efficient way?

Well... you could do it manually with a regex, like:
```
re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)
```
Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2021-01-05 02:28
Unfortunately, the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter. See this question for examples.

The distinction between "narrow" and "wide" is that "narrow" interpreters internally store 16-bit code units (UCS-2), whereas "wide" interpreters internally store 32-bit code units (UCS-4). Code points U+10000 and above (outside the basic-multilingual plane) have a len of two on "narrow" interpreters because two UCS-2 code units are needed to represent them (using surrogates), and that's what len measures. On "wide" builds only a single UCS-4 code unit is required for a non-BMP code point, so for those builds len is one for such code points.

I have confirmed that the below handles all unicode strings whether or not they contain surrogate pairs, and works in CPython 2.7 both narrow and wide builds. (Arguably, specifying a string like u'\ud83d\udc4d' in a wide interpreter reflects an affirmative desire to represent a complete surrogate code point as distinct from a partial-character code unit and is therefore not automatically an error to be corrected, but I'm ignoring that here. It's an edge case and normally not a desired use case.)

The @invoke trick used below is a way to avoid repeat computation without adding anything to the module's __dict__.
```
invoke = lambda f: f()  # trick taken from AJAX frameworks

@invoke
def codepoint_count():
  testlength = len(u'\U00010000')  # pre-compute once
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:
    def closure(data):  # count function for "wide" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:
    def is_surrogate(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):  # count function for "narrow" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(is_surrogate, data))
  return closure

assert codepoint_count(u'hello \U0001f44d') == 7
assert codepoint_count(u'hello \ud83d\udc4d') == 7
```
0 讨论(0)
发布评论:

提交评论
- 加载中...