I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.
And I am testing this with some char in the higher segm
Things to know:
There's something called execution character set in the C++ standard which is the term that describes what the output of string and character literals will be in the binary produced by compiler. You can read about it in the 1.1 Character sets subsection of 1 Overview section in The C Preprocessor's Manual on http://gcc.gnu.org site.
Question:
What will be produced as a result of "\u00fc"
string literal?
Answer:
It depends on what the execution character set is. In case of gcc (which is what you're using) it's by default UTF-8 unless you specify something different with -fexec-charset
option. You can read about this and other options controlling preprocessing phase in the 3.11 Options Controlling the Preprocessor subsection of 3 GCC Command Options section in GCC's Manual on http://gcc.gnu.org site. Now when we know that execution character set is UTF-8 we know that "\u00fc"
will be translated to UTF-8 encoding of U+00FC
Unicode's code point which is a sequence of two bytes 0xc3 0xbc
.
The QString's constructor taking char *
calls QString QString::fromAscii ( const char * str, int size = -1 ) which uses codec set with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec had been set) or does the same thing as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in case no codec had been set).
Question:
What codec will be used by QString's constructor to decode two byte sequence (0xc3 0xbc
) it gets?
Answer:
By default no codec is set with QTextCodec::setCodecForCStrings()
that's why Latin1 will be used to decode byte sequence. As 0xc3
and 0xbc
are both valid in Latin 1, representing respectively à and ¼ (this should already be familiar to you as it was taken directly from this answer to your earlier question) we get QString with these two characters.
You shouldn't use QDebug
class to output anything outside of ASCII. You have no guarantee what you get.
Test program:
#include
void dbg(char const * rawInput, QString s) {
QString codepoints;
foreach(QChar chr, s) {
codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
}
qDebug() << "Input: " << rawInput
<< ", "
<< "Unicode codepoints: " << codepoints;
}
int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);
qDebug() << "system name:"
<< QLocale::system().name();
for (int i = 1; i <= 5; ++i) {
switch(i) {
case 1:
qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
break;
case 2:
qDebug() << "\nWith codecForCStrings set to UTF-8\n";
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
break;
case 3:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
break;
case 4:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
break;
}
qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
? QTextCodec::codecForCStrings()->name()
: "NOT SET");
qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale()
? QTextCodec::codecForLocale()->name()
: "NOT SET");
qDebug() << "\n1. Using QString::QString(char const *)";
dbg("\\u00fc", QString("\u00fc"));
dbg("\\xc3\\xbc", QString("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));
qDebug() << "\n2. Using QString::fromUtf8(char const *)";
dbg("\\u00fc", QString::fromUtf8("\u00fc"));
dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));
qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
}
return app.exec();
}
Output on mingw 4.4.0 on Windows XP:
system name: "pl_PL"
Without codecForCStrings (default is Latin1)
codecForCStrings: "NOT SET"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
With codecForCStrings set to UTF-8
codecForCStrings: "UTF-8"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8
codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
I'd like to thank thiago, cbreak, peppe and heinz from #qt freenode.org IRC channel for showing and helping me to understand issues involved here.