Convert QString into QByteArray with either UTF-8 or Latin1 encoding

前端 未结 1 1039
春和景丽
春和景丽 2021-02-09 17:48

I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.

And I am testing this with some char in the higher segm

相关标签:
1条回答
  • 2021-02-09 18:29

    Things to know:

    • execution character page

    There's something called execution character set in the C++ standard which is the term that describes what the output of string and character literals will be in the binary produced by compiler. You can read about it in the 1.1 Character sets subsection of 1 Overview section in The C Preprocessor's Manual on http://gcc.gnu.org site.

    Question:
    What will be produced as a result of "\u00fc" string literal?

    Answer:
    It depends on what the execution character set is. In case of gcc (which is what you're using) it's by default UTF-8 unless you specify something different with -fexec-charset option. You can read about this and other options controlling preprocessing phase in the 3.11 Options Controlling the Preprocessor subsection of 3 GCC Command Options section in GCC's Manual on http://gcc.gnu.org site. Now when we know that execution character set is UTF-8 we know that "\u00fc" will be translated to UTF-8 encoding of U+00FC Unicode's code point which is a sequence of two bytes 0xc3 0xbc.

    • QString::QString ( const char * str ) and QByteArray & QByteArray::append ( const QString & str ) depend on global state

    The QString's constructor taking char * calls QString QString::fromAscii ( const char * str, int size = -1 ) which uses codec set with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec had been set) or does the same thing as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in case no codec had been set).

    Question:
    What codec will be used by QString's constructor to decode two byte sequence (0xc3 0xbc) it gets?

    Answer:
    By default no codec is set with QTextCodec::setCodecForCStrings() that's why Latin1 will be used to decode byte sequence. As 0xc3 and 0xbc are both valid in Latin 1, representing respectively à and ¼ (this should already be familiar to you as it was taken directly from this answer to your earlier question) we get QString with these two characters.

    • qDebug() is not 8-bit clean

    You shouldn't use QDebug class to output anything outside of ASCII. You have no guarantee what you get.

    Test program:

    #include <QtCore>
    
    void dbg(char const * rawInput, QString s) {
    
        QString codepoints;
        foreach(QChar chr, s) {
            codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
        }
    
        qDebug() << "Input: " << rawInput
                 << ", "
                 << "Unicode codepoints: " << codepoints;
    }
    
    int main(int argc, char *argv[])
    {
        QCoreApplication app(argc, argv);
    
        qDebug() << "system name:"
                 << QLocale::system().name();
    
        for (int i = 1; i <= 5; ++i) {
    
            switch(i) {
    
            case 1:
                qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
                break;
            case 2:
                qDebug() << "\nWith codecForCStrings set to UTF-8\n";
                QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
                break;
            case 3:
                qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
                QTextCodec::setCodecForCStrings(0);
                QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
                break;
            case 4:
                qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
                QTextCodec::setCodecForCStrings(0);
                QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
                break;
            }
    
            qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
                                               ? QTextCodec::codecForCStrings()->name()
                                               : "NOT SET");
            qDebug() << "codecForLocale:"   << (QTextCodec::codecForLocale()
                                               ? QTextCodec::codecForLocale()->name()
                                               : "NOT SET");
    
            qDebug() << "\n1. Using QString::QString(char const *)";
            dbg("\\u00fc", QString("\u00fc"));
            dbg("\\xc3\\xbc", QString("\xc3\xbc"));
            dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));
    
            qDebug() << "\n2. Using QString::fromUtf8(char const *)";
            dbg("\\u00fc", QString::fromUtf8("\u00fc"));
            dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
            dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));
    
            qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
            dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
            dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
            dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
        }
    
        return app.exec();
    }
    

    Output on mingw 4.4.0 on Windows XP:

    system name: "pl_PL"
    
    Without codecForCStrings (default is Latin1)
    
    codecForCStrings: "NOT SET"
    codecForLocale: "System"
    
    1. Using QString::QString(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    2. Using QString::fromUtf8(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    3. Using QString::fromLocal8Bit(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "102 13d "
    Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    With codecForCStrings set to UTF-8
    
    codecForCStrings: "UTF-8"
    codecForLocale: "System"
    
    1. Using QString::QString(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    2. Using QString::fromUtf8(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    3. Using QString::fromLocal8Bit(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "102 13d "
    Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8
    
    codecForCStrings: "NOT SET"
    codecForLocale: "UTF-8"
    
    1. Using QString::QString(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    2. Using QString::fromUtf8(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    3. Using QString::fromLocal8Bit(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1
    
    codecForCStrings: "NOT SET"
    codecForLocale: "ISO-8859-1"
    
    1. Using QString::QString(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    2. Using QString::fromUtf8(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    3. Using QString::fromLocal8Bit(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    codecForCStrings: "NOT SET"
    codecForLocale: "ISO-8859-1"
    
    1. Using QString::QString(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    
    2. Using QString::fromUtf8(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "fc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "
    
    3. Using QString::fromLocal8Bit(char const *)
    Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
    Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
    Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
    

    I'd like to thank thiago, cbreak, peppe and heinz from #qt freenode.org IRC channel for showing and helping me to understand issues involved here.

    0 讨论(0)
提交回复
热议问题