C11 Unicode Support | 易学教程

问题

I am writing some string conversion functions similar to atoi() or strtoll(). I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.

My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.

I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h in C99).

So I am wondering: what do they expect me to do with char16_t and char32_t?

回答1:

That's a good question with no apparent answer.

The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.

Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.

回答2:

Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9' to a digit. Given that, atoi_utf16/32() proceeds like atoi(). Simply inspect one character at a time.

Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit() which can be had be switching locales (setlocale()) if the needed locale is available. (Note: likely need to restore locale when the function is done.

Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.

Conversion steps:

Set locale to a corresponding one for UTF-16/UTF-32.
Use isspace() for white-space detection.
Convert is a similar fashion for your_atof().
Restore local.

回答3:

This question may be a bit old, but I'd like to touch on implementing your functions with char16_t and char32_t support.

The easiest way to do this is to write your strtoull function using the char32_t type (call it something like strtoull_c32). This makes parsing unicode easier because every character in UTF-32 occupies four bytes. Then implement strtoull_c16 and strtoull_c8 by internally converting both UTF-8 and UTF-16 encodings to UTF-32 and passing them to strtoull_c32.

I honestly haven't looked at the Unicode facilities in the C11 standard library, but if they don't provide a suitable way for converting those types to UTF-32 then you can use a third party library to make the conversion for you.

There's ICU, which was started by IBM and then adopted by the Unicode Consortium. It's a very feature-rich and stable library that's been around for a long time.

I started a UTF library (UTFX) for C89 recently, that you could use for this too. It's pretty simple and lightweight, unit tested and documented. You could give that a go, or use it to learn more about how UTF conversions work.

来源：https://stackoverflow.com/questions/26106647/c11-unicode-support

标签

unicode

c11