c++: How to support surrogate characters in utf8

问题

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

回答1:

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).

Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.

However, there is a requirement where it needs to support Surrogate pairs.

Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.

However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.

It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.

回答2:

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

So convert the utf-16 encoded strings to utf-8. Documentation here: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

Wrong question. Use UTF-8 internally.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

See above. Convert UTF-16 to UTF-8 for inbound data and convert back to UTF-16 outbound when necessary.

来源：https://stackoverflow.com/questions/42556605/c-how-to-support-surrogate-characters-in-utf8

标签

c++

utf-8

internationalization

utf-16

surrogate-pairs