How to convert std::string to lower case?

后端 未结 26 1675
旧时难觅i
旧时难觅i 2020-11-22 00:01

I want to convert a std::string to lowercase. I am aware of the function tolower(), however in the past I have had issues with this function and it

26条回答
  •  粉色の甜心
    2020-11-22 00:27

    tl;dr

    Use the ICU library. If you don't, your conversion routine will break silently on cases you are probably not even aware of existing.


    First you have to answer a question: What is the encoding of your std::string? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that? (Or does it fail miserably for characters over 0x7f?)

    If you are using UTF-8 (the only sane choice among the 8-bit encodings) with std::string as container, you are already deceiving yourself if you believe you are still in control of things. You are storing a multibyte character sequence in a container that is not aware of the multibyte concept, and neither are most of the operations you can perform on it! Even something as simple as .substr() could result in invalid (sub-) strings because you split in the middle of a multibyte sequence.

    As soon as you try something like std::toupper( 'ß' ), or std::tolower( 'Σ' ) in any encoding, you are in trouble. Because 1), the standard only ever operates on one character at a time, so it simply cannot turn ß into SS as would be correct. And 2), the standard only ever operates on one character at a time, so it cannot decide whether Σ is in the middle of a word (where σ would be correct), or at the end (ς). Another example would be std::tolower( 'I' ), which should yield different results depending on the locale -- virtually everywhere you would expect i, but in Turkey ı (LATIN SMALL LETTER DOTLESS I) is the correct answer (which, again, is more than one byte in UTF-8 encoding).

    So, any case conversion that works on a character at a time, or worse, a byte at a time, is broken by design. This includes all the std:: variants in existence at this time.

    Then there is the point that the standard library, for what it is capable of doing, is depending on which locales are supported on the machine your software is running on... and what do you do if your target locale is among the not supported on your client's machine?

    So what you are really looking for is a string class that is capable of dealing with all this correctly, and that is not any of the std::basic_string<> variants.

    (C++11 note: std::u16string and std::u32string are better, but still not perfect. C++20 brought std::u8string, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)

    While Boost looks nice, API wise, Boost.Locale is basically a wrapper around ICU. If Boost is compiled with ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.

    And believe me, getting Boost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows that include ICU, so you'd have to supply them together with your application, and that opens a whole new can of worms...)

    So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICU library directly:

    #include 
    #include 
    #include 
    
    #include 
    
    int main()
    {
        /*                          "Odysseus" */
        char const * someString = u8"ΟΔΥΣΣΕΥΣ";
        icu::UnicodeString someUString( someString, "UTF-8" );
        // Setting the locale explicitly here for completeness.
        // Usually you would use the user-specified system locale,
        // which *does* make a difference (see ı vs. i above).
        std::cout << someUString.toLower( "el_GR" ) << "\n";
        std::cout << someUString.toUpper( "el_GR" ) << "\n";
        return 0;
    }
    

    Compile (with G++ in this example):

    g++ -Wall example.cpp -licuuc -licuio
    

    This gives:

    ὀδυσσεύς
    

    Note that the Σ<->σ conversion in the middle of the word, and the Σ<->ς conversion at the end of the word. No -based solution can give you that.

提交回复
热议问题