Simple UTF8->UTF16 string conversion with iconv

前端 未结 2 2021
庸人自扰
庸人自扰 2021-01-21 08:28

I want to write a function to convert a UTF8 string to UTF16 (little-endian). The problem is, the iconv function does not seem to let you know in advance how many b

2条回答
  •  暖寄归人
    2021-01-21 09:13

    That's the correct way to use iconv.

    Remember that iconv is designed to be able to recode from an arbitrary character encoding to another arbitrary character encoding. It supports any combination. Given this, there are fundamentally really only 2 ways to know how much space you need on output:

    1. Take a guess. Do the conversion, and increase your guess as you go if necessary.
    2. Do the conversion twice. The first time, just count, discarding output. Allocate the total amount of space you counted, then do the conversion again.

    The first is what you do. The second one obviously has the disadvantage that you have to do the work twice. (By the way, you could do it the second way with iconv by using a scratchpad buffer in a local variable as the output buffer for the first pass.)

    There's really no other way. Either you know in advance how many characters (not bytes) there are in the input and how many of them are/aren't in the BMP; or you don't and you have to count them.

    In this case you happen to know what the input and output encodings will be ahead of time. You could do a better job of guessing the amount of output buffer space you need if you do some UTF-8 gymnastics on the input string yourself before starting. This is a bit like the second option above, but more optimized because the necessary UTF-8 gymnastics are not as expensive as full-blown iconv.

    Let me recommend that you don't do that, though. You'd still be making two passes on the input string so you wouldn't be saving that much, it would be a lot more code for you to write, and it introduces the possibility of a bug where the buffer could be undersized if the gymnastics aren't quite right.

    I'm not even going to describe the gymnastics because what it really amounts to more or less is implementing a UTF-8 decoder, and, though the core of it is just a few simple cases of bit masking and shifting, there are details related to rejecting invalid sequences that are easy to get wrong in a way that has security implications. So don't do it.

提交回复
热议问题