发表新帖

发表新帖

Utf-8 in c++: quick & dirty tricks

前端未结

关注

 3  1642

梦如初夏 2021-02-02 01:32

I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 \'string\' like objects.

However, I am working on an \'internati

3条回答

星月不相逢 (楼主)

2021-02-02 02:11
Well this dirty trick will not work. First, what is the value of mask after this:
```
   const unsigned char mask = 0x11000000;
   const unsigned char notUtf8Begin = 0x10000000;
```
Perhaps you are mixing hex representation with binary.

Second, as you correctly say in utf-8 encoding, a character may be several bytes long. std::count_if will iterate through all bytes in a UTF8 sequence. But what you actually need is to look at leading byte for every character and skip the rest until the next character comes.

It will not be hard to implement a single cycle which does the calculation and jumping forward using the simple mask table for leading bytes.

At the end you get the same O(n) for checking the characters and it will work with every UTF8 string.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题