How does decoding in UTF-8 know the byte boundaries?

点点圈 提交于 2021-01-21 07:53:04

问题


I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.

How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes \xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.

How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?


回答1:


The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.

If a byte starts with 0, it is a regular ASCII character.

If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).



来源:https://stackoverflow.com/questions/24113496/how-does-decoding-in-utf-8-know-the-byte-boundaries

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!