How UTF-16 and UTF-8 conversion happen?

问题

I'm kinda confused about unicode characters codepoints conversion to UTF-16 and I'm looking for someone who can explain it to me in the easiest way possible.

For characters like "𐒌" we get;

d801dc8c -->  UTF-16
0001048c -->  UTF-32
f090928c -->  UTF-8
66700    -->  Decimal Value

So, UTF-16 hexadecimal value converts to "11011000 00000001 11011100 10001100" which is "3624000652" in decimal value, so my question is how do we got this value in hexadecimal?? and how can we convert it back to the real codepoint of "66700". ???

UTF-32 hexadecimal value converts to "00000000 0000001 00000100 10001100" which is "66700" in decimal, but UTF-16 value doesn't convert back to "66700" and instead we get "3624000652".

How the conversion is actually happening??

Like for UTF-8,, 4-byte encoding it goes like 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

But how this happens in UTF-16 ?? If anyone can explain it to me in easiest possible way then that would be a huge help, because I've been searching for it for like past few days and haven't been able to find a good answer that makes sense to me.

Websites I used for conversion were Branah.com and rapidtables.com

回答1:

how do we got this value

how can we convert it back to the real codepoint

about surrogate pairs, how they work?

Study the algorithm for encoding to UTF-16:

my $U = 66_700; # code point
if ($U > 0xffff) {
    my $U_prime = $U - 0x1_0000; # some intermediate value 0x0_0000 .. 0xF_FFFF
    sprintf '%d', $U_prime;      # 1164
    sprintf '0x%04X', $U_prime;  # 0x048C
    sprintf '0b%020b', $U_prime; # 0b00000000010010001100

    my $high_ten_bits = $U_prime << 10;  # range 0x000 .. 0x3FF
    sprintf '0b%010b', $high_ten_bits;   # 0b0000000001

    my $low_ten_bits = $U_prime ^ 2**10; # range 0x000 .. 0x3FF
    sprintf '0b%010b', $low_ten_bits;    # 0b0010001100

    my $W1 = $high_ten_bits + 0xD800; # high surrogate
    sprintf '%d', $W1;      # 55297
    sprintf '0x%04X', $W1;  # 0xD801
    sprintf '0b%016b', $W1; # 0b1101100000000001

    my $W2 = $low_ten_bits + 0xDC00;  # low surrogate
    sprintf '%d', $W2;      # 56460
    sprintf '0x%04X', $W2;  # 0xDC8C
    sprintf '0b%016b', $W2; # 0b1101110010001100

    # finally emit the concatenation of W1 and W2

    # your original arithmetic checks out:
    ($W1 << 16) + $W2   # 3624000652
}

Reverse direction:

my @octets = (0xD8, 0x01, 0xDC, 0x8C);
my $W1 = ($octets[0] << 8) + $octets[1];
sprintf '%d', $W1;      # 55297
sprintf '0x%04X', $W1;  # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001

my $W2 = ($octets[2] << 8) + $octets[3];
sprintf '%d', $W2;      # 56460
sprintf '0x%04X', $W2;  # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100

my $high_ten_bits = $W1 - 0xD800;
sprintf '0b%010b', $high_ten_bits; # 0b0000000001

my $low_ten_bits = $W2 - 0xDC00;
sprintf '0b%010b', $low_ten_bits;  # 0b0010001100

my $U_prime = ($high_ten_bits << 10) + $low_ten_bits;
sprintf '%d', $U_prime;      # 1164
sprintf '0x%04X', $U_prime;  # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100

my $U = $U_prime + 0x1_0000;
sprintf '%d', $U; # 66700

来源：https://stackoverflow.com/questions/58207814/how-utf-16-and-utf-8-conversion-happen

标签

unicode

encoding

utf-8

utf-16