字符编码,向来是个老大难的问题,从python 2.7,到windows下各种中文乱码,再到mysql中文字符的传输存储和显示,每个程序员都应该趟过这些个坑。
今天好好聊聊字符编码的问题,算是一个总结。总结力求简洁明了,不求长篇大论。
问题1:啥是UTF-8, UTF-16, Unicode, ASCII, ANSI?
这个问题问的很多,不详细解释,直接附上stackoverflow的解答(https://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences),然后我再评述一下
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
Unicode本质上是类似于是一种逻辑上的码点(code point)的集合,每一个码点对应一个语言中的最小基本单元,例如汉字中的字,字母表中的字母。这是逻辑上的映射。码点本身就是一系列的数字1,2,3....
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
UTF系列才是真正的“编码”,即把unicode中所代表的码点,转化成(编码)一个新的二进制码,用于存储,偏向于实际的物理表示。举个例子好了,韩国字“한”,对应的unicode是U+D55C,就是一个数字,用十进制表示就是152534,用二进制表示1101 0101 0101 1100,这都是逻辑上的。那么按照utf-8来编码的话,二进制物理表示就成为了11101101 10010101 10011100,用十进制的角度来看这个二进制就是355 225 234,16进制为ED 95 9C,这些都是存储在计算机中的值,是真正的编码值。
常见的utf-8/16都是是编码,utf-8的长度1-4字节都有可能,utf-16则是2/4字节,大部分为2字节。
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ASCII最简单的编码方式,不多说。
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
ANSI这个说法经常看到,更多是指本地系统的编码方式。例如中文就是GB2312。
问题2:啥是string和wstring?
stackoverflow上也有很好的解答(https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring)
总结一下,char_t和wchar_t和unicode啥的没啥关系,他们只是数据类型而已,char_t一般一个字节,wchar_t在linux上4个字节,在windows上2个字节。
这两种数据类型更多是为utf8和utf16准备的,如上文所说,char_t用于utf8,可以是一个字节,也可以是4个字节,utf16则对应2-4个字节,即1-2个wchar_t。
为什么说区分很重要,理论上wchar_t都可以由几个char_t组成,但关键在于分组对应的问题,例如一个utf16编码,如果用一个wchar_t来存储,则代表一个字(glyph),这两个字节是一个整体,如果分成两个utf8的话,再来解码可能就是另外两个字了。
utf8和utf16的互转,可利用c++11里面的codecvt方法。可参阅https://stackoverflow.com/questions/4804298/how-to-convert-wstring-into-string
问题3:中文乱码怎么来的?
首先来看乱码的原因,本质上就是一句话,用错误的编码方法去解释另一种编码方法所编制的码元。所谓解铃还须系铃人,不过如此。
中文乱码怎么来的,明确中文的编码方式有很多种,我国有制定属于自己的一套GB2312标准,也被纳入了国际标准,但仍旧不同于unicode。GB12312有自己的编码存储方式,和utf-8/utf-16都不兼容
GB2312 character set is sub set of Unicode character set. This means that every character defined in GB2312 is also defined in Unicode.
However, GB2312 codes and Unicode codes are totally un-related. For example, GB2312 character with code value of 0xB0A1 has a Unicode code value of 0x554A. There is no mathematical formula to convert a GB2312 code to a Unicode code of the same character.
所以当一个utf8编码的东东,被windows的GB2312编码的命令行控制台拿来显示的时候,自然就是乱码了。
编解码的过程无处不在,它可以在每个环节,可以是输出到控制台(解码),可以是解析json字符串(解码),写文件(编码),所以只要是没对应上就会出现问题。
补充一点,utf-8和utf-16均是针对unicode的,但char_t和wchar_t并不是和utf完全一一绑定的存在,任何字节流都可以用char_t或者wchar_t来表示。
对于一个GB2312编码的内容,你可以用char_t来存,也可以用wchar_t来存,但是在此基础上去做utf系列之间的转换就是你的不对了。
来源:https://www.cnblogs.com/ShaneZhang/p/12399229.html