C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

前端 未结 3 1154
粉色の甜心
粉色の甜心 2020-12-18 03:41

I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

3条回答
  •  时光说笑
    2020-12-18 04:17

    It's not an algorithm, but if I understand correctly the rules are as such:

    • every byte having a MSB of 0 adds 2 bytes (1 UTF-16 code unit)
      • that byte represents a single Unicode codepoint in the range U+0000 - U+007F
    • every byte having the MSBs 110 or 1110 adds 2 bytes (1 UTF-16 code unit)
      • these bytes start 2- and 3-byte sequences respectively which represent Unicode codepoints in the range U+0080 - U+FFFF
    • every byte having the 4 MSB set (i.e. starting with 1111) adds 4 bytes (2 UTF-16 code units)
      • these bytes start 4-byte sequences which cover "the rest" of the Unicode range, which can be represented with a low and high surrogate in UTF-16
    • every other byte (i.e. those starting with 10) can be skipped
      • these bytes are already counted with the others.

    I'm not a C expert, but this looks easily vectorizable.

提交回复
热议问题