问题
I have the character "ö". If I look in this UTF-8 table I see it has the hex value F6
. If I look in the Unicode table I see that "ö" has the indices E0
and 16
. If I add both I get the hex value of the code point of F6
. This is the binary value 1111 0110
.
1) How do I get from the hex value F6
to the indices E0
and 16
?
2) I don't know how to come from F6
to the two bytes C3
B6
...
Because I didn't got the results I tried to go the other way. "ö" is represented in ISO-8859-1 as "ö". In the UTF-8 table I can see that "Ã" has the decimal value 195
and "¶" has the decimal value 182
. Converted to bits this is 1100 0011 1011 0110
.
Process:
Look in a table and get the unicode for the character "ö". Calculated from the indices
E0
and16
you get the UnicodeU+00F6
.According to the algorithm posted by wildplasser you can calculate the coded UTF-8 value
C3
andB6
.In the binary form you get
1100 0011 1011 0110
which corresponds to the decimal values195
and182
.If these values are interpreted as ISO 8859-1 (only 1 byte) then you get "ö".
PS: I found also this link, which shows the values from step 2.
回答1:
The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.
In Unicode, every character ("code point") has a unique number assigned to it. The character ö
is assigned the code point U+00F6
, which is F6
in hexadecimal, and 246
in decimal.
UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.
If you do that transformation, you will see that U+00F6
transforms to the UTF-8 sequence C3 B6
, or 1100 0011 1011 0110
in binary, which is why that is the UTF-8 representation of ö
.
The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö
is F6
in Latin-1.
Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.
See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.
回答2:
unsigned cha_latin2utf8(unsigned char *dst, unsigned cha)
{
if (cha < 0x80) { *dst = cha; return 1; }
/* all 11 bit codepoints (0x0 -- 0x7ff)
** fit within a 2byte utf8 char
** firstbyte = 110 +xxxxx := 0xc0 + (char>>6) MSB
** second = 10 +xxxxxx := 0x80 + (char& 63) LSB
*/
*dst++ = 0xc0 | (cha >>6) & 0x1f; /* 2+1+5 bits */
*dst++ = 0x80 | (cha) & 0x3f; /* 1+1+6 bits */
return 2; /* number of bytes produced */
}
To test it:
#include <stdio.h>
int main (void)
{
char buff[12];
cha_latin2utf8 ( buff, 0xf6);
fprintf(stdout, "%02x %02x\n"
, (unsigned) buff[0] & 0xff
, (unsigned) buff[1] & 0xff );
return 0;
}
The result:
c3 b6
来源:https://stackoverflow.com/questions/7903684/convert-character-from-utf-8-to-iso-8859-1-manually