How is this octet stream being interpreted as Hebrew UTF-8 encoding?

问题

The following byte stream is identified by as UTF-8, it contains the Hebrew sentence: דירות לשותפים בתל אביב - הומלס. I'm trying to understand the encoding.

ubuntu@ip-10-126-21-104:~$ od -t x1 homeless-title-fromwireshark_followed_by_hexdump.txt
0000000 0a 09 d7 93 d7 99 d7 a8 d7 95 d7 aa 20 d7 9c d7
0000020 a9 d7 95 d7 aa d7 a4 d7 99 d7 9d 20 20 d7 91 d7
0000040 aa d7 9c 20 d7 90 d7 91 d7 99 d7 91 20 2d 20 d7
0000060 94 d7 95 d7 9e d7 9c d7 a1 0a
0000072
ubuntu@ip-10-126-21-104:~$ file -i homeless-title-fromwireshark_followed_by_hexdump.txt
homeless-title-fromwireshark_followed_by_hexdump.txt: text/plain; charset=utf-8

The file is UTF-8, I've verified this by opening notepad (windows 7), inputing the Hebrew character ד and then saving the file. The result of which yields the following:

ubuntu@ip-10-126-21-104:~$ od -t x1 test_from_notepad_utf8_daled.txt
0000000 ef bb bf d7 93
0000005
ubuntu@ip-10-126-21-104:~$ file -i test_from_notepad_utf8_daled.txt
test_from_notepad_utf8_daled.txt: text/plain; charset=utf-8

Where ef bb bf is the BOM encoded in utf-8 form and d7 93 is exactly the sequence of bytes that appears in the original stream after 0a 09 (new line, tab in ascii).

The problem here is that by unicode code pages, ד should be coded as 05 D3 so why and how did the utf-8 encoding came to be d7 93 ?

d7 93 in binary is 11010111 10010011, while
05 D3 in binary is 00000101 11010011

I can't seem to find a correct transformation that will make sense for these encoding, that (to my understanding) represent the same Unicode entity, which is "HEBREW LETTER DALET"

Thank you,
Maxim.

回答1:

Unicode code points U+0000..U+007F are encoded in UTF-8 as a single byte 0x00..0x7F.

Unicode code points u+0080..U+07FF (including HEBREW LETTER DALET U+05D3) are encoded in UTF-8 as two bytes. The binary values for these can be split into a group of 5 bits and a group of 6 bits, as in xxxxxyyyyyy. The first byte of the UTF-8 representation has the bit pattern 110xxxxx; the second has the bit pattern 10yyyyyy.

0x05D3 = 0000 0101 1101 0011

The last 6 bits of 0x05D3 are 010011; prefixed by the 10, that gives 1001 0011 or 0x93. The previous 5 bits are 10111; prefixed by the 110, that gives 1101 0111 or 0xD7.

Hence, the UTF-8 encoding for U+05D3 is 0xD7 0x93.

There are more rules for Unicode code points U+0800 upwards that require 3 or 4 (but not more) bytes for the UTF-8 representation. The continuation bytes always have the 10yyyyyy bit pattern. The first bytes have bit patterns 1110xxxx (3 byte values) and 11110xxx (4 byte values). There are a number of byte values that cannot appear in valid UTF-8; they are 0xC0, 0xC1, and 0xF5..0xFF.

回答2:

Unicode defines (among other things) a bunch of "code points" and gives each one a numerical value. The value for HEBREW LETTER DALET is U+05D3 or 0x05D3. But that's just a number, and that DOES NOT tell you how to "encode" the code point (i.e. the set of actual bits) in a file/in memory...UTF-8 (as well as UTF-16, UTF-32 and a variety of other schemes) tell you how to do that.

There is actually a formulaic way of translating Unicode code points to UTF-8 characters (but that's a whole different SO question). It turns out that in UTF-8, HEBREW LETTER DALET is encoded as 0xD7 0x93. By the way, if you find a text editor that allows you to save as UTF-32 or UCS-4, you will find that that (in addition to a very large file) the bytes that you see with a hex-editor should match the code points from the Unicode spec.

This page may give a little extra information on some of the representations for that one character.

For a great introduction to Unicode, I would suggest Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

回答3:

Legacy codepages defined a set of characters and their mapping to byte sequences. Unicode separates the concepts of character set and character encoding.

So, the Unicode character set is a list of code points. Each code point is assigned a unique value as an identifier - ד is U+05D3.

The encodings - Unicode transformation formats - describe how to encode each code point as a sequence of code units.

UTF-8 uses a 1-octet code unit and code points are encoded as sequences of between one and four bytes. The algorithm is described in RFC 3629.

A similar procedure exists for UTF-16 which uses 2-octet code units - each code point is two or four bytes. And there isn't anything to do for UTF-32 except make every value four bytes long. These encodings can come in big- or little-endian forms, so U+05D3 might be 00 00 05 D3 or D3 05 00 00 in UTF-32. The BOM is often used to tell which encoding is being used and what the endianness is if the encoding of the data is ambiguous.

There's also UTF-7, but I've never seen it in the wild.

来源：https://stackoverflow.com/questions/6153883/how-is-this-octet-stream-being-interpreted-as-hebrew-utf-8-encoding

标签

unicode

encoding

utf-8

character-encoding

hebrew