I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don\'t understand why
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode has the hexadecimal amount of 110000, which is 1114112
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
To give a metaphorically accurate answer, all of them
.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß
is represented as the byte sequence 81 30 89 38
, which contains the encoding of the digits 0
and 8
. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8
will find a false positive within the letter ß
.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.