Convert character from UTF-8 to ISO-8859-1 manually

笑着哭i 提交于 2019-12-01 10:39:26

The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.

In Unicode, every character ("code point") has a unique number assigned to it. The character ö is assigned the code point U+00F6, which is F6 in hexadecimal, and 246 in decimal.

UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.

If you do that transformation, you will see that U+00F6 transforms to the UTF-8 sequence C3 B6, or 1100 0011 1011 0110 in binary, which is why that is the UTF-8 representation of ö.

The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö is F6 in Latin-1.

Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.

See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.

unsigned cha_latin2utf8(unsigned char *dst, unsigned cha)
{
if (cha <  0x80)  { *dst = cha; return 1; }
    /* all 11 bit codepoints (0x0 -- 0x7ff)
      ** fit within a 2byte utf8 char
      ** firstbyte = 110 +xxxxx := 0xc0 + (char>>6) MSB
      ** second    = 10 +xxxxxx := 0x80 + (char& 63) LSB
      */
    *dst++ = 0xc0 | (cha >>6) & 0x1f; /* 2+1+5 bits */
    *dst++ = 0x80 | (cha) & 0x3f; /* 1+1+6 bits */

return 2; /* number of bytes produced */
}

To test it:

#include <stdio.h>
int main (void)
{
char buff[12];

cha_latin2utf8 ( buff, 0xf6);

fprintf(stdout, "%02x %02x\n"
    , (unsigned) buff[0] & 0xff
    , (unsigned) buff[1] & 0xff );

return 0;
}

The result:

c3 b6
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!