How many bytes do we need to store an arabic character

后端 未结 2 1055
说谎
说谎 2021-02-11 04:59

I\'m a little confused about the storage needed for representing an arabic character.

Please let me know if this is true:

  • in ISO/IEC 8859-6 encoding it tak
2条回答
  •  野趣味
    野趣味 (楼主)
    2021-02-11 06:02

    Well first, Unicode is not an encoding. It is a standard for assigning code points to every character in every language. These code points are integers; how many bytes they take up depends on the specific encoding. The most common Unicode encodings are UTF-8 and UTF-16.

    To summarise:

    • ISO 8859-6 uses 1 byte for each Arabic character, but doesn't support "Arabic presentation forms", nor characters from any other script than ASCII.
    • UTF-8 uses 2 bytes for each Arabic character, and 3 bytes for "Arabic presentation forms".
    • UTF-16 uses 2 bytes for each Arabic character, including "Arabic presentation forms".

    I will use two examples: 'ح' (U+062D) and 'ﻰ' (U+FEF0). Those numbers are hexadecimal codes representing the Unicode code point of each of those characters.

    In ISO 8859-6, most Arabic characters take up just a single byte, since that encoding is dedicated to Arabic. For example, the character 'ح' (U+062D) is encoded as the single byte "CD", as you can see from the table on the Wikipedia article. The character 'ﻰ' (U+FEF0) is listed as an "Arabic Presentation Form", so I suppose that explains why it doesn't appear in ISO 8859-6 at all (you can't encode this character in that encoding).

    There are two very common Unicode encodings which let you encode all characters: UTF-8 and UTF-16. They have slightly different uses. UTF-8 uses one byte for ASCII characters, between 2 and 3 bytes for basic characters (including all of Arabic) and 4 bytes for other characters. UTF-16 uses two bytes for basic characters, and 4 bytes for other characters. So basically, if you are using lots of ASCII, UTF-8 is better. For international text, UTF-16 is better.

    In UTF-8, 'ح' (U+062D) is encoded as the 2-byte sequence "D8 AD", while 'ﻰ' (U+FEF0) is encoded as the 3-byte sequence "EF BB B0". Basically, characters between U+0080 and U+07FF use 2 bytes, and characters between U+07FF and U+FFFF use 3 bytes. So all the basic Arabic and Arabic supplement characters use 2 bytes, whereas the Arabic Presentation Forms use 3 bytes.

    In UTF-16, 'ح' (U+062D) is encoded as the 2-byte sequence "2D 06", while 'ﻰ' (U+FEF0) is encoded as the 2-byte sequence "F0 FE". In UTF-16, all Arabic characters are two bytes. This is further complicated by endianness. Note that the bytes in UTF-16 are just the code points with the two parts swapped around. An equally valid encoding is "06 2D" for the first one, and "FE F0" for the second.

    In summary, I would usually recommend UTF-8 as it is unambiguous and supports ASCII text very well. Arabic characters are 2 bytes in either encoding (unless you use "presentation forms"). You can use ISO 8859-6 if you are only using ASCII and Arabic characters, and nothing else, and that will save you some space, but it usually isn't worth it, as it will break as soon as some other characters come along. UTF-8 and UTF-16 support all characters in Unicode.

提交回复
热议问题