Efficient way to calculate byte length of a character, depending on the encoding

后端 未结 4 874
予麋鹿
予麋鹿 2021-02-06 00:48

What\'s the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8

4条回答
  •  走了就别回头了
    2021-02-06 01:12

    If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.

    Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

提交回复
热议问题