Unicode string indexing in C++

后端 未结 5 1607
刺人心
刺人心 2020-12-30 15:05

I come from python where you can use \'string[10]\' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when

相关标签:
5条回答
  • 2020-12-30 15:38

    To access codepoints individually, use u32string, which represents a string as a sequence of UTF-32 code units of type char32_t.

    u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
    cout << ramp << "\n";    
    cout << ramp[5] << "\n";
    
    0 讨论(0)
  • 2020-12-30 15:39

    C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.

    0 讨论(0)
  • Answering about what is going on, cplusplus.com makes it clear:

    Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).

    About the solution, others had it right: ICU if you are not using C++11; u32string if you are.

    0 讨论(0)
  • 2020-12-30 15:50

    Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.

    The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @, $, and the backtick).

    C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...

    Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).

    All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...

    C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.

    Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)

    However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.

    And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.

    #include <unicode/unistr.h>
    #include <unicode/ustream.h>
    #include <unicode/ustdio.h>
    
    #include <iostream>
    
    int main()
    {
        // make sure your source file is UTF-8 encoded...
        icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
        std::cout << ramp << "\n";
        std::cout << ramp[5] << "\n";
        u_printf( "%C\n", ramp[5] );
    }
    

    Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc.

    ÐðŁłŠšÝýÞþŽž
    353
    š
    

    (1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.

    0 讨论(0)
  • 2020-12-30 15:50

    In my opinion, the best solution is to do any task with strings using iterators. I can't imagine a scenario where one really has to index strings: if you need indexing like ramp[5] in your example, then the 5 is usually computed in other part of the code and usually you scan all the preceding characters anyway. That's why Standard Library uses iterators in its API.

    A similar problem comes up if you want to get the size of a string. Should it be character (or code point) count or merely number of bytes? Usually you need the size to allocate a buffer so byte count is more desirable. You only very, very rarely have to get Unicode character count.

    If you want to process UTF-8 encoded strings using iterators then I would definitely recommend UTF8-CPP.

    0 讨论(0)
提交回复
热议问题