I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
It's not an algorithm, but if I understand correctly the rules are as such:
0
adds 2 bytes (1 UTF-16 code unit)
110
or 1110
adds 2 bytes (1 UTF-16 code unit)
1111
) adds 4 bytes (2 UTF-16 code units)
10
) can be skipped
I'm not a C expert, but this looks easily vectorizable.
Very simple: count the number of head bytes, double-counting bytes F0
and up.
In code:
size_t count(unsigned char *s)
{
size_t l;
for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
return l;
}
Note: This function returns the length in UTF-16 code units. If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes).
Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.
There are 4 cases that need to be considered, simply take the worst case as the final buffer size:
The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.