Printing Unicode characters using write(2) in c

问题

I'm working on a small piece of code that prints characters to the screen, and must support all of Unicode contained in a wchar_t, and i'm limited to only write(2). I managed to print an emoji using :

write(1, "\U0001f921", 6);

So \U seem to be the way to go. However, i can't get to convert the wchar_t into the proper escape sequence, ie converting wchar_t c = L'🤡'; into \U0001f921

Can i even do that in C ?

Thanks a lot.

回答1:

I'm working on a small piece of code that prints characters to the screen, and must support all of Unicode contained in a wchar_t, and i'm limited to only write(2).

That's a problematic combination of requirements. In particular, wchar_t character representation may very well not play nicely with using write() for output.

More generally, there are multiple issues here, among them:

The members of the source and execution character sets.
How to represent extended characters of the execution character set in your source (via the source character set).
How to present extended characters of the execution character set to the output device of your choice, so that device handles them as desired.

Note well that that C specifies only a fairly small set of characters that must be present in the execution character set. Additional, "extended", characters may be present in it, and your emoji would fall into this category. Dealing with extended characters via the standard C interfaces is a bit mushy, as the standard affords implementations a great deal freedom in how they do things there.

So \U seem to be the way to go.

The \U introduces a "universal character name". It is important to understand that these sequences are converted to members of the execution character set during compilation.

However, i can't get to convert the wchar_t into the proper escape sequence, ie converting wchar_t c = L'🤡'; into \U0001f921

It is not safe to assume that '🤡' can be represented directly in the source character set, so as to use it literally in your source code. That depends on your C implementation. A universal character name is safer. Furthermore, if you want a wide character constant then you can try L'\U0001f921', but there's a good chance that wchar_t cannot represent that character. In particular, many implementations have 16-bit wchar_t, and those are unlikely to be able to support your character as a (single) wchar_t.

You may have better luck with a wide string literal: L"\U0001f921", but this is useful to you primarily if you are working with the wide-character-specific functions, which will perform appropriate encoding conversions for you. write() will not perform such conversions, so whether it produces the desired result will depend on the configuration of your runtime environment. I judge your original approach, with an ordinary string literal, to be more likely to work.

If you wish, and if you can use C2011 features, then you can also express a (regular) string literal that is defined to be encoded in UTF-8, regardless of what the actual execution character set is. The form for that would be u8"\U0001f921". Again, though, producing your desired result this way depends on your environment. UTF-8 literals are better suited to interacting with interfaces that are specifically defined to use UTF-8.

Can i even do that in C ?

It is not safe to assume that your emoji character can be represented by a single object of type wchar_t. There may be C implementations that support it, but I think they are uncommon.

One final note: this code ...

write(1, "\U0001f921", 6);

... almost certainly exhibits undefined behavior as a result of overrunning the bounds of the char array you are presenting to write(). I don't see any plausible scenario in which it is longer than 5 characters, but you write 6, overrunning by at least 1. If the internal representation is UTF-8, then that array will have length 4 -- three bytes encoding the character, and one for the string terminator.

You should measure the length to find out how many bytes to write, for example:

const char *emoji = "\U0001f921";
write(1, emoji, strlen(emoji));

来源：https://stackoverflow.com/questions/47815000/printing-unicode-characters-using-write2-in-c

标签

unicode

wchar-t