Python length of unicode string confusion

后端 未结 1 1918
时光取名叫无心
时光取名叫无心 2021-02-13 22:57

There\'s been quite some help around this already, but I am still confused.

I have a unicode string like this:

title = u\'         


        
相关标签:
1条回答
  • 2021-02-13 23:25

    You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

    In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

    You can encode your text to UTF-16 and divide the number of bytes by two (each UTF-16 code unit is 2 bytes). Pick the utf-16-le or utf-16-be variant to not include a BOM in the length:

    title = u'                                                                    
    0 讨论(0)
提交回复
热议问题