std::string and UTF-8 encoded unicode

后端未结

关注

 3  1309

耶瑟儿～

If I understand well, it is possible to use both string and wstring to store UTF-8 text.

With char, ASCII characters take a single byte, some chinese charac

相关标签:

3条回答

没有蜡笔的小新

2021-01-05 11:30

You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-01-05 11:33

You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...

string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.

There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2021-01-05 11:55

You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.

Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break. string::compare will work but do not expect to get alphabetical ordering. That is language dependent. string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.

Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).

0 讨论(0)
发布评论:

提交评论
- 加载中...