Strings and character encoding in C++

前端 未结 3 1784
轮回少年
轮回少年 2021-01-01 22:04

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasona

3条回答
  •  一整个雨季
    2021-01-01 22:40

    If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

    The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

    Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

    With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

    The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

    If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

提交回复
热议问题