Can we switch between ASCII and Unicode

后端 未结 2 1569
孤街浪徒
孤街浪徒 2021-01-02 19:53

I came across \"char variable is in Unicode format, but adopts / maps well to ASCII also\". What is the need to mention that? Of course ASCII is 1 byte and Unicode is 2. And

2条回答
  •  隐瞒了意图╮
    2021-01-02 20:36

    Unicode is a strict superset of ASCII (and Latin 1 for that matter), at least regarding the character set. Not so much for the actual encodings on the byte level. So there cannot be a language/environment that supports Unicode but not ASCII. What the sentence above means is that if you only deal with ASCII text it works all just fine because, as noted, Unicode is a superset of ASCII.

    Also, to clear up a few of your misconceptions:

    1. “ASCII is 1 byte and Unicode is 2” — ASCII is a 7-bit code, that uses 1 byte for each character. Bytes and characters are therefore the same in ASCII (which is unfortunate, because ideally bytes are just data and text is in characters, but I digress). Unicode is a 21-bit code that defines a mapping of code points (numbers) to characters. How these numbers are represented varies depending on the encoding. There is UTF-32 which is a fixed-width encoding where each Unicode code point is represented as a 32-bit code unit. UTF-16 is what Java uses, which uses either two or four bytes (one or two code units) per code point. But that's 16 bits per code unit, not per code point or actual character (in the Unicode sense). Then there is UTF-8 which uses 8-bit code units and represents code points as either one, two, three or four code units.

    2. For Java at least the platform has no say whatsoever in whether it supports only ASCII or Unicode. Java always uses Unicode and chars represent UTF-16 code units (which can be half-characters), not code points (which would be characters) and are therefore a bit misleadingly named. What you're probably referring to is Unix' tradition of combining language, locale and preferred system encoding in a few environment variables. That is you can have a system where that preferred encoding specifies a legacy encoding and applications that blindly use that can have problems. That doesn't mean you cannot build an application that supports Unicode on such systems. iconv has to work somehow, after all.

提交回复
热议问题