How to convert from utf-16 to utf-32 on Linux with std library?

前端 未结 1 1058
感情败类
感情败类 2021-01-14 14:17

On MSVC converting utf-16 to utf-32 is easy - with C11\'s codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature has

相关标签:
1条回答
  • 2021-01-14 14:33

    Decoding UTF-16 into UTF-32 is extremely easy.

    You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the functions you need).

    Inputs:

    • a pointer to the source UTF-16 data (char16_t *, ushort *, -- for convenience UTF16 *);
    • its size;
    • a pointer to the UTF-32 data (char32_t *, uint * -- for convenience UTF32 *).

    Code looks like:

    void convert_utf16_to_utf32(const UTF16 *input, 
                                size_t input_size, 
                                UTF32 *output) 
    {
        const UTF16 * const end = input + input_size;
        while (input < end) {
            const UTF16 uc = *input++;
            if (!is_surrogate(uc)) {
                *output++ = uc; 
            } else {
                if (is_high_surrogate(uc) && input < end && is_low_surrogate(*input))
                    *output++ = surrogate_to_utf32(uc, *input++);
                else
                    // ERROR
            }
        }
    }
    

    Error handling is left. You might want to insert a U+FFFD¹ into the stream and keep on going, or just bail out, really up to you. The auxiliary functions are trivial:

    int is_surrogate(UTF16 uc) { return (uc - 0xd800u) < 2048u; }
    int is_high_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xd800; }
    int is_low_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xdc00; }
    
    UTF32 surrogate_to_utf32(UTF16 high, UTF16 low) { 
        return (high << 10) + low - 0x35fdc00; 
    }
    

    ¹ Cf. Unicode:

    • § 3.9 Unicode Encoding Forms (Best Practices for Using U+FFFD)
    • § 5.22 Best Practice for U+FFFD Substitution

    ² Also consider that the !is_surrogate(uc) branch is by far the most common (as well the non-error path in the second if), you might want to optimize that with __builtin_expect or similar.

    0 讨论(0)
提交回复
热议问题