How do I get STL std::string to work with unicode on windows?

前端 未结 9 2092
猫巷女王i
猫巷女王i 2021-02-04 13:29

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top

相关标签:
9条回答
  • 2021-02-04 13:53

    Yes - by being more aware of locales and encodings.

    Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.

    If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.

    Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.

    Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:

    which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales

    In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)

    The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.

    0 讨论(0)
  • 2021-02-04 13:53

    If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.

    0 讨论(0)
  • 2021-02-04 13:54

    There are several misconceptions in your question.

    • Neither C++ nor the STL deal with encodings.

    • std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.

    • Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.

    0 讨论(0)
  • 2021-02-04 13:59

    Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.

    0 讨论(0)
  • 2021-02-04 14:00

    No, there is no way to make Windows treat "narrow" strings as UTF-8.

    Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).

    • Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
    • In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
    • In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).

    Other approaches that I tried but don't like much:

    • typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
    • Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
    • Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
    0 讨论(0)
  • 2021-02-04 14:09

    You should consider using QString and QByteArray, it has good unicode support

    0 讨论(0)
提交回复
热议问题