问题
I am writing some string conversion functions similar to atoi()
or strtoll()
. I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.
My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.
I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h
in C99).
So I am wondering: what do they expect me to do with char16_t and char32_t?
回答1:
That's a good question with no apparent answer.
The uchar.h
types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t
or char32_t
) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t
, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8
, u
, and U
, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.
回答2:
Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9'
to a digit. Given that, atoi_utf16/32()
proceeds like atoi()
. Simply inspect one character at a time.
Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit()
which can be had be switching locales (setlocale()
) if the needed locale is available. (Note: likely need to restore locale when the function is done.
Converting a character that passes isdigit()
but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.
Conversion steps:
Set locale to a corresponding one for UTF-16/UTF-32.
Use
isspace()
for white-space detection.Convert is a similar fashion for
your_atof()
.Restore local.
回答3:
This question may be a bit old, but I'd like to touch on implementing your functions with char16_t
and char32_t
support.
The easiest way to do this is to write your strtoull
function using the char32_t
type (call it something like strtoull_c32
). This makes parsing unicode easier because every character in UTF-32
occupies four bytes. Then implement strtoull_c16
and strtoull_c8
by internally converting both UTF-8
and UTF-16
encodings to UTF-32
and passing them to strtoull_c32
.
I honestly haven't looked at the Unicode facilities in the C11 standard library, but if they don't provide a suitable way for converting those types to UTF-32
then you can use a third party library to make the conversion for you.
There's ICU, which was started by IBM and then adopted by the Unicode Consortium. It's a very feature-rich and stable library that's been around for a long time.
I started a UTF library (UTFX) for C89 recently, that you could use for this too. It's pretty simple and lightweight, unit tested and documented. You could give that a go, or use it to learn more about how UTF conversions work.
来源:https://stackoverflow.com/questions/26106647/c11-unicode-support