Strings and character encoding in C++

前端未结

关注

 3  1784

轮回少年 2021-01-01 22:04

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasona

3条回答

一整个雨季 (楼主)

2021-01-01 22:40

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...