C++ strings: UTF-8 or 16-bit encoding?

后端 未结 8 1605
广开言路
广开言路 2021-02-04 09:42

I\'m still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or s

8条回答
  •  一整个雨季
    2021-02-04 10:32

    I've actually written a widely used application (5million+ users) so every kilobyte used adds up, literally. Despite that, I just stuck to wxString. I've configured it to be derived from std::wstring, so I can pass them to functions expecting a wstring const&.

    Please note that std::wstring is native Unicode on the Mac (no UTF-16 needed for characters above U+10000), and therefore it uses 4 bytes/wchar_t. The big advantage of this is that i++ gets you the next character, always. On Win32 that is true in only 99.9% of the cases. As a fellow programmer, you'll understand how little 99.9% is.

    But if you're not convinced, write the function to uppercase a std::string[UTF-8] and a std::wstring. Those 2 functions will tell you which way is insanity.

    Your on-disk format is another matter. For portability, that should be UTF-8. There's no endianness concern in UTF-8, nor a discussion over the width (2/4). This may be why many programs appear to use UTF-8.

    On a slightly unrelated note, please read up on Unicode string comparisions and normalization. Or you'll end up with the same bug as .NET, where you can have two variables föö and föö differing only in (invisible) normalization.

提交回复
热议问题