问题
I'm kinda confused about unicode characters codepoints conversion to UTF-16 and I'm looking for someone who can explain it to me in the easiest way possible.
For characters like "𐒌" we get;
d801dc8c --> UTF-16
0001048c --> UTF-32
f090928c --> UTF-8
66700 --> Decimal Value
So, UTF-16 hexadecimal value converts to "11011000 00000001 11011100 10001100
" which is "3624000652
" in decimal value, so my question is how do we got this value in hexadecimal?? and how can we convert it back to the real codepoint of "66700
". ???
UTF-32 hexadecimal value converts to "00000000 0000001 00000100 10001100
" which is "66700
" in decimal, but UTF-16 value doesn't convert back to "66700
" and instead we get "3624000652
".
How the conversion is actually happening??
Like for UTF-8,, 4-byte encoding it goes like 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
But how this happens in UTF-16 ?? If anyone can explain it to me in easiest possible way then that would be a huge help, because I've been searching for it for like past few days and haven't been able to find a good answer that makes sense to me.
Websites I used for conversion were Branah.com and rapidtables.com
回答1:
how do we got this value
how can we convert it back to the real codepoint
about surrogate pairs, how they work?
Study the algorithm for encoding to UTF-16:
my $U = 66_700; # code point
if ($U > 0xffff) {
my $U_prime = $U - 0x1_0000; # some intermediate value 0x0_0000 .. 0xF_FFFF
sprintf '%d', $U_prime; # 1164
sprintf '0x%04X', $U_prime; # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100
my $high_ten_bits = $U_prime << 10; # range 0x000 .. 0x3FF
sprintf '0b%010b', $high_ten_bits; # 0b0000000001
my $low_ten_bits = $U_prime ^ 2**10; # range 0x000 .. 0x3FF
sprintf '0b%010b', $low_ten_bits; # 0b0010001100
my $W1 = $high_ten_bits + 0xD800; # high surrogate
sprintf '%d', $W1; # 55297
sprintf '0x%04X', $W1; # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001
my $W2 = $low_ten_bits + 0xDC00; # low surrogate
sprintf '%d', $W2; # 56460
sprintf '0x%04X', $W2; # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100
# finally emit the concatenation of W1 and W2
# your original arithmetic checks out:
($W1 << 16) + $W2 # 3624000652
}
Reverse direction:
my @octets = (0xD8, 0x01, 0xDC, 0x8C);
my $W1 = ($octets[0] << 8) + $octets[1];
sprintf '%d', $W1; # 55297
sprintf '0x%04X', $W1; # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001
my $W2 = ($octets[2] << 8) + $octets[3];
sprintf '%d', $W2; # 56460
sprintf '0x%04X', $W2; # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100
my $high_ten_bits = $W1 - 0xD800;
sprintf '0b%010b', $high_ten_bits; # 0b0000000001
my $low_ten_bits = $W2 - 0xDC00;
sprintf '0b%010b', $low_ten_bits; # 0b0010001100
my $U_prime = ($high_ten_bits << 10) + $low_ten_bits;
sprintf '%d', $U_prime; # 1164
sprintf '0x%04X', $U_prime; # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100
my $U = $U_prime + 0x1_0000;
sprintf '%d', $U; # 66700
来源:https://stackoverflow.com/questions/58207814/how-utf-16-and-utf-8-conversion-happen