I want to write a function to convert a UTF8 string to UTF16 (little-endian). The problem is, the iconv
function does not seem to let you know in advance how many b
That's the correct way to use iconv
.
Remember that iconv
is designed to be able to recode from an arbitrary character encoding to another arbitrary character encoding. It supports any combination. Given this, there are fundamentally really only 2 ways to know how much space you need on output:
The first is what you do. The second one obviously has the disadvantage that you have to do the work twice. (By the way, you could do it the second way with iconv
by using a scratchpad buffer in a local variable as the output buffer for the first pass.)
There's really no other way. Either you know in advance how many characters (not bytes) there are in the input and how many of them are/aren't in the BMP; or you don't and you have to count them.
In this case you happen to know what the input and output encodings will be ahead of time. You could do a better job of guessing the amount of output buffer space you need if you do some UTF-8 gymnastics on the input string yourself before starting. This is a bit like the second option above, but more optimized because the necessary UTF-8 gymnastics are not as expensive as full-blown iconv
.
Let me recommend that you don't do that, though. You'd still be making two passes on the input string so you wouldn't be saving that much, it would be a lot more code for you to write, and it introduces the possibility of a bug where the buffer could be undersized if the gymnastics aren't quite right.
I'm not even going to describe the gymnastics because what it really amounts to more or less is implementing a UTF-8 decoder, and, though the core of it is just a few simple cases of bit masking and shifting, there are details related to rejecting invalid sequences that are easy to get wrong in a way that has security implications. So don't do it.
Converting UTF-8 to UTF-16 will never more than double the size of the data. Worst-case is ASCII (1->2 bytes). All other BMP codepoints in UTF-8 take 2 or 3 bytes (and thus remain the same size or get smaller when converted to UTF-16. Non-BMP codepoints are exactly 4 bytes in either UTF-8 or UTF-16.
Thus, you can eliminate the wasteful, complex, and error-prone realloc
logic for enlarging the buffer.
By the way, make sure you leave space for null termination which won't be counted by strlen
.