I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasona
If you plan on just passing strings around and never inspect them, you can use plain std::string
though it's a poor man job.
The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.
Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.
With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string
nor std::wstring
are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.
The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.
If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU
library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.