Simple UTF8->UTF16 string conversion with iconv

问题

I want to write a function to convert a UTF8 string to UTF16 (little-endian). The problem is, the iconv function does not seem to let you know in advance how many bytes you'll need to store the output string.

My solution is to start by allocating 2*strlen(utf8), and then run iconv in a loop, increasing the size of that buffer with realloc if necessary:

static int utf8_to_utf16le(char *utf8, char **utf16, int *utf16_len)
{
    iconv_t cd;
    char *inbuf, *outbuf;
    size_t inbytesleft, outbytesleft, nchars, utf16_buf_len;

    cd = iconv_open("UTF16LE", "UTF8");
    if (cd == (iconv_t)-1) {
        printf("!%s: iconv_open failed: %d\n", __func__, errno);
        return -1;
    }

    inbytesleft = strlen(utf8);
    if (inbytesleft == 0) {
        printf("!%s: empty string\n", __func__);
        iconv_close(cd);
        return -1;
    }
    inbuf = utf8;
    utf16_buf_len = 2 * inbytesleft;            // sufficient in many cases, i.e. if the input string is ASCII
    *utf16 = malloc(utf16_buf_len);
    if (!*utf16) {
        printf("!%s: malloc failed\n", __func__);
        iconv_close(cd);
        return -1;
    }
    outbytesleft = utf16_buf_len;
    outbuf = *utf16;

    nchars = iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
    while (nchars == (size_t)-1 && errno == E2BIG) {
        char *ptr;
        size_t increase = 10;                   // increase length a bit
        size_t len;
        utf16_buf_len += increase;
        outbytesleft += increase;
        ptr = realloc(*utf16, utf16_buf_len);
        if (!ptr) {
            printf("!%s: realloc failed\n", __func__);
            free(*utf16);
            iconv_close(cd);
            return -1;
        }
        len = outbuf - *utf16;
        *utf16 = ptr;
        outbuf = *utf16 + len;
        nchars = iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
    }
    if (nchars == (size_t)-1) {
        printf("!%s: iconv failed: %d\n", __func__, errno);
        free(*utf16);
        iconv_close(cd);
        return -1;
    }

    iconv_close(cd);
    *utf16_len = utf16_buf_len - outbytesleft;

    return 0;
}

Is this really the best way to do it? Repeated reallocs seems wasteful, but without knowing what character sequences could be in the utf8, and what they would result in in utf16, I don't know if I can make a better guess for the initial buffer size than 2*strlen(utf8).

回答1:

That's the correct way to use iconv.

Remember that iconv is designed to be able to recode from an arbitrary character encoding to another arbitrary character encoding. It supports any combination. Given this, there are fundamentally really only 2 ways to know how much space you need on output:

Take a guess. Do the conversion, and increase your guess as you go if necessary.
Do the conversion twice. The first time, just count, discarding output. Allocate the total amount of space you counted, then do the conversion again.

The first is what you do. The second one obviously has the disadvantage that you have to do the work twice. (By the way, you could do it the second way with iconv by using a scratchpad buffer in a local variable as the output buffer for the first pass.)

There's really no other way. Either you know in advance how many characters (not bytes) there are in the input and how many of them are/aren't in the BMP; or you don't and you have to count them.

In this case you happen to know what the input and output encodings will be ahead of time. You could do a better job of guessing the amount of output buffer space you need if you do some UTF-8 gymnastics on the input string yourself before starting. This is a bit like the second option above, but more optimized because the necessary UTF-8 gymnastics are not as expensive as full-blown iconv.

Let me recommend that you don't do that, though. You'd still be making two passes on the input string so you wouldn't be saving that much, it would be a lot more code for you to write, and it introduces the possibility of a bug where the buffer could be undersized if the gymnastics aren't quite right.

I'm not even going to describe the gymnastics because what it really amounts to more or less is implementing a UTF-8 decoder, and, though the core of it is just a few simple cases of bit masking and shifting, there are details related to rejecting invalid sequences that are easy to get wrong in a way that has security implications. So don't do it.

回答2:

Converting UTF-8 to UTF-16 will never more than double the size of the data. Worst-case is ASCII (1->2 bytes). All other BMP codepoints in UTF-8 take 2 or 3 bytes (and thus remain the same size or get smaller when converted to UTF-16. Non-BMP codepoints are exactly 4 bytes in either UTF-8 or UTF-16.

Thus, you can eliminate the wasteful, complex, and error-prone realloc logic for enlarging the buffer.

By the way, make sure you leave space for null termination which won't be counted by strlen.

来源：https://stackoverflow.com/questions/13297458/simple-utf8-utf16-string-conversion-with-iconv

标签

string

utf-8

posix

iconv