Is ISO-8859-1 a Unicode charset?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-22 08:38:23

问题


I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.

Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?


回答1:


No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.

Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.




回答2:


ISO 8859-1 is not Unicode

ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.

However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.


Control characters

Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.

In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).




回答3:


ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.

All ASCII is UTF-8 Unicode.

All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.

All UTF-8 single-byte character are contained in ASCII.

UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.




回答4:


No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.

If you look at code formats you have something like

  • Abstract letters - The letters you are using
  • Code table - Bring the letters in some form (like alphabetic ordering)
  • Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
  • Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16) [character encoding of steering instruction (e.g. < in XML)]



回答5:


It depends on how you define "Unicode format."

I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).

In that case, no, ISO 8859-1 is not a Unicode format.

However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.

Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.

Here you can find a more complete answer: Unicode.org



来源:https://stackoverflow.com/questions/12794825/is-iso-8859-1-a-unicode-charset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!