Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

前端 未结 5 1648
既然无缘
既然无缘 2020-12-05 07:02

We know that codepoints can be in this interval 0..10FFFF which is less than 2^21. Then why do we need UTF-32 when all codepoints can be represented by 3 bytes? UTF-24 shoul

相关标签:
5条回答
  • 2020-12-05 07:27

    It's true that only 21 bits are required (reference), but modern computers are good at moving 32-bit units of things around and generally interacting with them. I don't think I've ever used a programming language that had a 24-bit integer or character type, nor a platform where that was a multiple of the processor's word size (not since I last used an 8-bit computer; UTF-24 would be reasonable on an 8-bit machine), though naturally there have been some.

    0 讨论(0)
  • 2020-12-05 07:30

    UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned). Going from 1 byte to 2 bytes to 4 bytes is the most "logical" procession.

    Apart from that: The Unicode standard is ever-growing. Codepoints outside of that range could eventually be assigned (it is somewhat unlikely in the near future, however, due to the huge number of unassigned codepoints still available).

    0 讨论(0)
  • 2020-12-05 07:34

    First there were 2 character coding schemes: UCS-4 that coded each character into 32 bits, as an unsigned integer in range 0x00000000 - 0x7FFFFFFF, and UCS-2 that used 16 bits for each codepoint.

    Later it was found out that using just the 65536 codepoints of UCS-2 would get one into problems anyway, but many programs (Windows, cough) relied on wide characters being 16 bits wide, so UTF-16 was created. UTF-16 encodes the codepints in the range U+0000 - U+FFFF just like UCS-2; and U+10000 - U+10FFFF using surrogate pairs, i.e. a pair of two 16-bit values.

    As this was a bit complicated, UTF-32 was introduced, as a simple one-to-one mapping for characters beyond U+FFFF. Now, since UTF-16 can only encode up to U+10FFFF, it was decided that this is will be the maximum value that will be ever assigned, so that there will be no further compatibility problems, so UTF-32 indeed just uses 21 bits. As an added bonus, UTF-8, which was initially planned to be a 1-6-byte encoding, now never needs more than 4 bytes for each code point. Therefore it can be easily proven that it never requires more storage than UTF-32.

    It is true that a hypothetical UTF-24 format would save memory. However its savings would be dubious anyway, as it would mostly consume more storage than UTF-8, except for just blasts of emoji or such - and not many interesting texts of significant length consist solely of emojis.

    But, UTF-32 is used as in memory representation for text in programs that need to have simply-indexed access to codepoints - it is the only encoding where the Nth element in a C array is also the Nth codepoint - UTF-24 would do the same for 25 % memory savings but more complicated element accesses.

    0 讨论(0)
  • 2020-12-05 07:35

    Two reasons I can think of:

    • It allows for future expansion
    • (More importantly) Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.

    I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 221, but it's just simpler to use int than to create a 24-bit type.

    0 讨论(0)
  • 2020-12-05 07:44

    UTF-24 has no added value.

    • If space matters, UTF-8 can encode all existing unicode characters (0...0x10FFFF) in the same 3 bytes or less (and in most cases will need less than 3 bytes). So UTF-8 is more compact than UTF-24.

    • If space doesn't matter, UTF-32 is faster than UTF-24, because computers work better with power-of-2 aligned data.

    0 讨论(0)
提交回复
热议问题