C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

前端 未结 3 1155
粉色の甜心
粉色の甜心 2020-12-18 03:41

I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

相关标签:
3条回答
  • 2020-12-18 04:17

    It's not an algorithm, but if I understand correctly the rules are as such:

    • every byte having a MSB of 0 adds 2 bytes (1 UTF-16 code unit)
      • that byte represents a single Unicode codepoint in the range U+0000 - U+007F
    • every byte having the MSBs 110 or 1110 adds 2 bytes (1 UTF-16 code unit)
      • these bytes start 2- and 3-byte sequences respectively which represent Unicode codepoints in the range U+0080 - U+FFFF
    • every byte having the 4 MSB set (i.e. starting with 1111) adds 4 bytes (2 UTF-16 code units)
      • these bytes start 4-byte sequences which cover "the rest" of the Unicode range, which can be represented with a low and high surrogate in UTF-16
    • every other byte (i.e. those starting with 10) can be skipped
      • these bytes are already counted with the others.

    I'm not a C expert, but this looks easily vectorizable.

    0 讨论(0)
  • 2020-12-18 04:32

    Very simple: count the number of head bytes, double-counting bytes F0 and up.

    In code:

    size_t count(unsigned char *s)
    {
        size_t l;
        for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
        return l;
    }
    

    Note: This function returns the length in UTF-16 code units. If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes).

    0 讨论(0)
  • 2020-12-18 04:36

    Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.

    There are 4 cases that need to be considered, simply take the worst case as the final buffer size:

    • U+0000-U+007F - will encode to 1byte in utf8, and 2bytes per character in utf16. (1:2 = x2)
    • U+0080-U+07FF - encoded to 2byte utf8 sequences, or 2byte per character utf16 characters. (2:2 = x1)
    • U+0800-U+FFFF - are stored as 3byte utf8 sequences, but still fit in single utf16 characters. (3:2 = x.67)
    • U+10000-U+10FFFF - are stored as 4byte utf8 sequences, or surrogate pairs in utf16. (4:4 = x1)

    The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.

    0 讨论(0)
提交回复
热议问题