问题
C++11 introduced the raw string literals which can be pretty useful to represent quoted strings, literals with lots of special symbols like windows file paths, regex expressions etc...
std::string path = R"(C:\teamwork\new_project\project1)"; // no tab nor newline!
std::string quoted = R"("quoted string")";
std::string expression = R"([\w]+[ ]+)";
This raw string literals can also be combined with encoding prefixes (u8
, u
, U
, or L
), but, when no encoding prefix is specified, does the file encoding matters?, lets suppose that I have this code:
auto message = R"(Pick up a card)"; // raw string 1
auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)"; // raw string 2
If I can write and store the code above, its obvious that my source code is encoded as unicode, so I'm wondering:
- The
raw string 1
would be a unicode literal? (though it only uses ASCII characters), in other words, does the raw string inherits the codification of the file where is written or the compiler auto-detects that unicode isn't needed regardless of the file encoding? - Would be necessary the encoding prefix
U
on theraw string 2
in order to treat it as unicode literal or it would be unicode automatically due to its contents and/or the source file encoding?
Thanks for your attention.
EDIT:
Testing the code above in ideone.com and printing the demangled type of message
and cards
variables, it outputs char const*
:
template<typename T> std::string demangle(T t)
{
int status;
char *const name = abi::__cxa_demangle(typeid(T).name(), 0, 0, &status);
std::string result(name);
free(name);
return result;
}
int main()
{
auto message = R"(Pick up a card)";
auto cards = R"(🂡🂢🂣🂤🂥🂦🂧🂨🂩🂪🂫🂬🂭🂮)";
std::cout
<< "message type: " << demangle(message) << '\n'
<< "cards type: " << demangle(cards) << '\n';
return 0;
}
Output:
message type: char const*
cards type: char const*
which is even most weird than I thought, I was convinced that the type would be wchar_t
(even without the L
prefix).
回答1:
Yes it matters, even to compile your source. You will gonna need to use somenthing like -finput-charset=UTF-16
to compile if you are using gcc
(the same thing should apply to VS).
But I IHMO, there are something more fundamental to take into account in your code. For example, std::string
are containers to char
, which is 1 byte large. If you are dealing with a UTF-16 for instance, you will need 2 bytes, so (despite a 'by-hand conversion') you will need at least a wchar_t
(std::wstring) (or, to be safer a char16_t
, to be safer in C++11
).
So, to use Unicode you will need a container for it and a compiling environment prepared to handle your Unicode codifided sources.
回答2:
Raw string literals change how escapes are dealt with but do not change how encodings are handled. Raw string literals still convert their contents from the source encoding to produce a string in the appropriate execution encoding.
The type of a string literal and the appropriate execution encoding is determined entirely by the prefix. R
alone always produces a char
string in the narrow execution encoding. If the source is UTF-16 (and the compiler supports UTF-16 as the source encoding) then the compiler will convert the string literal contents from UTF-16 to the narrow execution encoding.
来源:https://stackoverflow.com/questions/21460700/raw-string-literals-and-file-codification