I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
It's not an algorithm, but if I understand correctly the rules are as such:
0
adds 2 bytes (1 UTF-16 code unit)
110
or 1110
adds 2 bytes (1 UTF-16 code unit)
1111
) adds 4 bytes (2 UTF-16 code units)
10
) can be skipped
I'm not a C expert, but this looks easily vectorizable.