In a previous answer I gave, I responded to the following warning being caused by the fact that \'\\u0B95\'
requires three bytes and so is a multicharacter
Because you have no character encoding prefix gcc (and any other conformant compiler) will see '\u0B95'
and think 1) char type and 2) multicharacter because there is more than one char code in the string.
u'\u0B95'
is a UTF16 character.u'\u0B95\u0B97'
is a multicharacter UTF16 character.U'\ufacebeef'
is a UTF32 character.etc.
I would argue as follows:
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for
char
(for literals with no prefix)... (From section 2.14.3.4)
If '\u0B95'
falls outside of the implementation-defined range defined for char
(which it would if char
is 8 bits), it's value is then implementation defined, at which point GCC can make its value a sequence of multiple c-char
s, thus becoming a multicharacter literal.
You are correct, according to the spec '\u0B95'
is a char-typed character literal with a value equal to the character's encoding in the execution character set. And you're right that the spec doesn't say anything about the case where this is not possible for char literals due to a single char being unable to represent that value. The behavior is undefined.
There are defect reports filed with the committee on this issue: E.g., http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#912
The currently proposed resolution seems to be to specify that these character literals are also int
s and have implementation defined values (although the proposed language isn't quite right for that), just like multichar literals. I'm not a fan of that solution, and I think a better solution is to say such literals are ill-formed.
This is what's implemented in clang: http://coliru.stacked-crooked.com/a/952ce7775dcf7472
Somebody posted an answer that correctly answered the second part of my question (what value will the char
have?) but has since deleted their post. Since that part was correct, I'll reproduce it here together with my answer for the first part (is it a multicharacter literal?).
'\u0B95'
is not a multicharacter literal and gcc is mistaken here. As stated in the question, a multicharacter literal is defined by (§2.14.3/1):
An ordinary character literal that contains more than one c-char is a multicharacter literal.
Since a universal-character-name is one expansion of a c-char, the literal '\u0B95'
contains only one c-char. It would make sense if ordinary literals could not contain a universal-character-name for \u0B95
to be considered as six seperate characters (\
, u
, 0
, etc.) but I cannot find this restriction anywhere. Therefore, it is a single character and the literal is not a multicharacter literal.
To further support this, why would it be considered to be multiple characters? At this point we haven't even given it an encoding so we don't know how many bytes it would take up. In UTF-16 it would take 2 bytes, in UTF-8 it would take 3 bytes and in some imagined encoding it could take just 1 byte.
So what value will the character literal have? First the universal-character-name is mapped to the corresponding encoding in the execution character set, unless it has not mapping in which case it has implementation-defined encoding (§2.14.3/5):
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.
Either way, the char
literal gets the value equal to the numerical value of the encoding (§2.14.3/1):
An ordinary character literal that contains a single c-char has type
char
, with value equal to the numerical value of the encoding of the c-char in the execution character set.
Now the important part, inconveniently tucked away in a different paragraph further in the section. If the value can not be represented in the char
, it gets an implementation-defined value (§2.14.3/4):
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for
char
(for literals with no prefix) ...