Simple UTF8->UTF16 string conversion with iconv

前端未结

关注

 2  2020

I want to write a function to convert a UTF8 string to UTF16 (little-endian). The problem is, the iconv function does not seem to let you know in advance how many b

相关标签:

2条回答

暖寄归人

2021-01-21 09:13
That's the correct way to use iconv.

Remember that iconv is designed to be able to recode from an arbitrary character encoding to another arbitrary character encoding. It supports any combination. Given this, there are fundamentally really only 2 ways to know how much space you need on output:
1. Take a guess. Do the conversion, and increase your guess as you go if necessary.
2. Do the conversion twice. The first time, just count, discarding output. Allocate the total amount of space you counted, then do the conversion again.
The first is what you do. The second one obviously has the disadvantage that you have to do the work twice. (By the way, you could do it the second way with iconv by using a scratchpad buffer in a local variable as the output buffer for the first pass.)

There's really no other way. Either you know in advance how many characters (not bytes) there are in the input and how many of them are/aren't in the BMP; or you don't and you have to count them.

In this case you happen to know what the input and output encodings will be ahead of time. You could do a better job of guessing the amount of output buffer space you need if you do some UTF-8 gymnastics on the input string yourself before starting. This is a bit like the second option above, but more optimized because the necessary UTF-8 gymnastics are not as expensive as full-blown iconv.

Let me recommend that you don't do that, though. You'd still be making two passes on the input string so you wouldn't be saving that much, it would be a lot more code for you to write, and it introduces the possibility of a bug where the buffer could be undersized if the gymnastics aren't quite right.

I'm not even going to describe the gymnastics because what it really amounts to more or less is implementing a UTF-8 decoder, and, though the core of it is just a few simple cases of bit masking and shifting, there are details related to rejecting invalid sequences that are easy to get wrong in a way that has security implications. So don't do it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2021-01-21 09:15

Converting UTF-8 to UTF-16 will never more than double the size of the data. Worst-case is ASCII (1->2 bytes). All other BMP codepoints in UTF-8 take 2 or 3 bytes (and thus remain the same size or get smaller when converted to UTF-16. Non-BMP codepoints are exactly 4 bytes in either UTF-8 or UTF-16.

Thus, you can eliminate the wasteful, complex, and error-prone realloc logic for enlarging the buffer.

By the way, make sure you leave space for null termination which won't be counted by strlen.

0 讨论(0)
发布评论:

提交评论
- 加载中...