How does a file with Chinese characters know how many bytes to use per character?

前端 未结 9 1607
误落风尘
误落风尘 2020-12-13 05:05

I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'

相关标签:
9条回答
  • 2020-12-13 05:14

    If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:

    Scalar Value                 1st Byte  2nd Byte  3rd Byte  4th Byte
    00000000 0xxxxxxx            0xxxxxxx
    00000yyy yyxxxxxx            110yyyyy  10xxxxxx
    zzzzyyyy yyxxxxxx            1110zzzz  10yyyyyy  10xxxxxx
    000uuuuu zzzzyyyy  yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx
    

    There are a number of non-allowed values - in particular, bytes 0xC1, 0xC2, and 0xF5 - 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

    Code Points          1st Byte  2nd Byte  3rd Byte  4th Byte
    U+0000..U+007F       00..7F
    U+0080..U+07FF       C2..DF    80..BF
    U+0800..U+0FFF       E0        A0..BF    80..BF
    U+1000..U+CFFF       E1..EC    80..BF    80..BF
    U+D000..U+D7FF       ED        80..9F    80..BF
    U+E000..U+FFFF       EE..EF    80..BF    80..BF
    U+10000..U+3FFFF     F0        90..BF    80..BF    80..BF
    U+40000..U+FFFFF     F1..F3    80..BF    80..BF    80..BF
    U+100000..U+10FFFF   F4        80..8F    80..BF    80..BF
    

    These tables are lifted from the Unicode standard version 5.1.


    In the question, the material from offset 0x0010 .. 0x008F yields:

    0x61           = U+0061
    0x61           = U+0061
    0x61           = U+0061
    0xE6 0xBE 0xB3 = U+6FB3
    0xE5 0xA4 0xA7 = U+5927
    0xE5 0x88 0xA9 = U+5229
    0xE4 0xBA 0x9A = U+4E9A
    0xE4 0xB8 0xAD = U+4E2D
    0xE6 0x96 0x87 = U+6587
    0xE8 0xAE 0xBA = U+8BBA
    0xE5 0x9D 0x9B = U+575B
    0x2C           = U+002C
    0xE6 0xBE 0xB3 = U+6FB3
    0xE6 0xB4 0xB2 = U+6D32
    0xE8 0xAE 0xBA = U+8BBA
    0xE5 0x9D 0x9B = U+575B
    0x2C           = U+002C
    0xE6 0xBE 0xB3 = U+6FB3
    0xE6 0xB4 0xB2 = U+6D32
    0xE6 0x96 0xB0 = U+65B0
    0xE9 0x97 0xBB = U+95FB
    0x2C           = U+002C
    0xE6 0xBE 0xB3 = U+6FB3
    0xE6 0xB4 0xB2 = U+6D32
    0xE4 0xB8 0xAD = U+4E2D
    0xE6 0x96 0x87 = U+6587
    0xE7 0xBD 0x91 = U+7F51
    0xE7 0xAB 0x99 = U+7AD9
    0x2C           = U+002C
    0xE6 0xBE 0xB3 = U+6FB3
    0xE5 0xA4 0xA7 = U+5927
    0xE5 0x88 0xA9 = U+5229
    0xE4 0xBA 0x9A = U+4E9A
    0xE6 0x9C 0x80 = U+6700
    0xE5 0xA4 0xA7 = U+5927
    0xE7 0x9A 0x84 = U+7684
    0xE5 0x8D 0x8E = U+534E
    0x2D           = U+002D
    0x29           = U+0029
    0xE5 0xA5 0xA5 = U+5965
    0xE5 0xB0 0xBA = U+5C3A
    0xE7 0xBD 0x91 = U+7F51
    0x26           = U+0026
    0x6C           = U+006C
    0x74           = U+0074
    0x3B           = U+003B
    
    0 讨论(0)
  • 2020-12-13 05:18

    The hint is in this sentence here:

    In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

    Every code point up to 127 has the top bit set to zero. Therefore, the editor knows that if it encounters a byte where the top bit is a 1, it is the start of a multi-byte character.

    0 讨论(0)
  • 2020-12-13 05:20

    Essentially, if it begins with a 0, it's a 7 bit code point. If it begins with 10, it's a continuation of a multi-byte codepoint. Otherwise, the number of 1's tell you how many bytes this code point is encoded as.

    The first byte indicates how many bytes encode the code point.

    0xxxxxxx 7 bits of code point encoded in 1 bytes

    110xxxxx 10xxxxxx 10 bits of code point encoded in 2 bytes

    110xxxxx 10xxxxxx 10xxxxxx etc. 1110xxxx 11110xxx etc.

    0 讨论(0)
  • 2020-12-13 05:22

    That's all part of the UTF8 encoding (which is only one encoding scheme for Unicode).

    The size can figured out by examining the first byte as follows:

    • if it starts with bit pattern "10" (0x80-0xbf), it's not the first byte of a sequence and you should back up until you find the start, any byte that starts with "0" or "11" (thanks to Jeffrey Hantin for pointing that out in the comments).
    • if it starts with bit pattern "0" (0x00-0x7f), it's 1 byte.
    • if it starts with bit pattern "110" (0xc0-0xdf), it's 2 bytes.
    • if it starts with bit pattern "1110" (0xe0-0xef), it's 3 bytes.
    • if it starts with bit pattern "11110" (0xf0-0xf7), it's 4 bytes.

    I'll duplicate the table showing this, but the original is on the Wikipedia UTF8 page here.

    +----------------+----------+----------+----------+----------+
    | Unicode        | Byte 1   | Byte 2   | Byte 3   | Byte 4   |
    +----------------+----------+----------+----------+----------+
    | U+0000-007F    | 0xxxxxxx |          |          |          |
    | U+0080-07FF    | 110yyyxx | 10xxxxxx |          |          |
    | U+0800-FFFF    | 1110yyyy | 10yyyyxx | 10xxxxxx |          |
    | U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx |
    +----------------+----------+----------+----------+----------+
    

    The Unicode characters in the above table are constructed from the bits:

    000z-zzzz yyyy-yyyy xxxx-xxxx
    

    where the z and y bits are assumed to be zero where they're not given. Some bytes are considered illegal as a start byte since they're either:

    • useless: a 2-byte sequence starting with 0xc0 or 0xc1 actually gives a code point less than 0x80 which can be represented better with a 1-byte sequence.
    • used by RFC3629 for 4-byte sequence above U+10FFFF, or 5-byte and 6-byte sequences. These are the bytes 0xf5 through 0xfd.
    • just unused: bytes 0xfe and 0xff.

    In addition, subsequent bytes in a multi-byte sequence that don't begin with the bits "10" are also illegal.

    As an example, consider the sequence [0xf4,0x8a,0xaf,0x8d]. This is a 4-byte sequence as the first byte falls between 0xf0 and 0xf7.

        0xf4     0x8a     0xaf     0x8d
    = 11110100 10001010 10101111 10001101
           zzz   zzyyyy   yyyyxx   xxxxxx
    
    = 1 0000 1010 1011 1100 1101
      z zzzz yyyy yyyy xxxx xxxx
    
    = U+10ABCD
    

    For your specific query with the first byte 0xe6 (length = 3), the byte sequence is:

        0xe6     0xbe     0xb3
    = 11100110 10111110 10110011
          yyyy   yyyyxx   xxxxxx
    
    = 01101111 10110011
      yyyyyyyy xxxxxxxx
    
    = U+6FB3
    

    If you look that code up here, you'll see it's the one you had in your question: 澳.

    To show how the decoding works, I went back to my archives to find my UTF8 handling code. I've had to morph it a bit to make it a complete program and the encoding has been removed (since the question was really about decoding), so I hope I haven't introduced any errors from the cut and paste:

    #include <stdio.h>
    #include <string.h>
    
    #define UTF8ERR_TOOSHORT -1
    #define UTF8ERR_BADSTART -2
    #define UTF8ERR_BADSUBSQ -3
    typedef unsigned char uchar;
    
    static int getUtf8 (uchar *pBytes, int *pLen) {
        if (*pLen < 1) return UTF8ERR_TOOSHORT;
    
        /* 1-byte sequence */
        if (pBytes[0] <= 0x7f) {
            *pLen = 1;
            return pBytes[0];
        }
    
        /* Subsequent byte marker */
        if (pBytes[0] <= 0xbf) return UTF8ERR_BADSTART;
    
        /* 2-byte sequence */
        if ((pBytes[0] == 0xc0) || (pBytes[0] == 0xc1)) return UTF8ERR_BADSTART;
        if (pBytes[0] <= 0xdf) {
            if (*pLen < 2) return UTF8ERR_TOOSHORT;
            if ((pBytes[1] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            *pLen = 2;
            return ((int)(pBytes[0] & 0x1f) << 6)
                | (pBytes[1] & 0x3f);
        }
    
        /* 3-byte sequence */
        if (pBytes[0] <= 0xef) {
            if (*pLen < 3) return UTF8ERR_TOOSHORT;
            if ((pBytes[1] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            if ((pBytes[2] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            *pLen = 3;
            return ((int)(pBytes[0] & 0x0f) << 12)
                | ((int)(pBytes[1] & 0x3f) << 6)
                | (pBytes[2] & 0x3f);
        }
    
        /* 4-byte sequence */
        if (pBytes[0] <= 0xf4) {
            if (*pLen < 4) return UTF8ERR_TOOSHORT;
            if ((pBytes[1] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            if ((pBytes[2] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            if ((pBytes[3] & 0xc0) != 0x80) return UTF8ERR_BADSUBSQ;
            *pLen = 4;
            return ((int)(pBytes[0] & 0x0f) << 18)
                | ((int)(pBytes[1] & 0x3f) << 12)
                | ((int)(pBytes[2] & 0x3f) << 6)
                | (pBytes[3] & 0x3f);
        }
    
        return UTF8ERR_BADSTART;
    }
    
    static uchar htoc (char *h) {
        uchar u = 0;
        while (*h != '\0') {
            if ((*h >= '0') && (*h <= '9'))
                u = ((u & 0x0f) << 4) + *h - '0';
            else
                if ((*h >= 'a') && (*h <= 'f'))
                    u = ((u & 0x0f) << 4) + *h + 10 - 'a';
                else
                    return 0;
            h++;
        }
        return u;
    }
    
    int main (int argCount, char *argVar[]) {
        int i;
        uchar utf8[4];
        int len = argCount - 1;
    
        if (len != 4) {
                printf ("Usage: utf8 <hex1> <hex2> <hex3> <hex4>\n");
                return 1;
        }
        printf ("Input:      (%d) %s %s %s %s\n",
            len, argVar[1], argVar[2], argVar[3], argVar[4]);
    
        for (i = 0; i < 4; i++)
                utf8[i] = htoc (argVar[i+1]);
    
        printf ("   Becomes: (%d) %02x %02x %02x %02x\n",
            len, utf8[0], utf8[1], utf8[2], utf8[3]);
    
        if ((i = getUtf8 (&(utf8[0]), &len)) < 0)
            printf ("Error %d\n", i);
        else
            printf ("   Finally: U+%x, with length of %d\n", i, len);
    
        return 0;
    }
    

    You can run it with your sequence of bytes (you'll need 4 so use 0 to pad them out) as follows:

    > utf8 f4 8a af 8d
    Input:      (4) f4 8a af 8d
       Becomes: (4) f4 8a af 8d
       Finally: U+10abcd, with length of 4
    
    > utf8 e6 be b3 0
    Input:      (4) e6 be b3 0
       Becomes: (4) e6 be b3 00
       Finally: U+6fb3, with length of 3
    
    > utf8 41 0 0 0
    Input:      (4) 41 0 0 0
       Becomes: (4) 41 00 00 00
       Finally: U+41, with length of 1
    
    > utf8 87 0 0 0
    Input:      (4) 87 0 0 0
       Becomes: (4) 87 00 00 00
    Error -2
    
    > utf8 f4 8a af ff
    Input:      (4) f4 8a af ff
       Becomes: (4) f4 8a af ff
    Error -3
    
    > utf8 c4 80 0 0
    Input:      (4) c4 80 0 0
       Becomes: (4) c4 80 00 00
       Finally: U+100, with length of 2
    
    0 讨论(0)
  • 2020-12-13 05:26

    UTF-8 is constructed in way such that there is no possible ambiguity about where a character starts and how many bytes it has.

    It's really simple.

    • A byte in the range 0x80 to 0xBF is never the first byte of a character.
    • Any other byte is always the first byte of a character.

    UTF-8 has a lot of redundancy.

    If you want to tell how many bytes long a character is, there are multiple ways to tell.

    • The first byte always tells you how many bytes long the character is:
      • If the first byte is 0x00 to 0x7F, it's one byte.
      • 0xC2 to 0xDF means it's two bytes.
      • 0xE0 to 0xEF means it's three bytes.
      • 0xF0 to 0xF4 means it's four bytes.
    • Or, you can just count the number of consecutive bytes in the range 0x80 to 0xBF, because these bytes all belong to the same character as the previous byte.

    Some bytes are never used, like 0xC1 to 0xC2 or 0xF5 to 0xFF, so if you encounter these bytes anywhere, then you are not looking at UTF-8.

    0 讨论(0)
  • 2020-12-13 05:26

    why there are so many complicated answers?

    3 bytes for 1 Chinese character. using this function( under jQuery) :

    function get_length(field_selector) {
      var escapedStr = encodeURI($(field_selector).val())
      if (escapedStr.indexOf("%") != -1) {
        var count = escapedStr.split("%").length - 1
        if (count == 0) count++  //perverse case; can't happen with real UTF-8
        var tmp = escapedStr.length - (count * 3)
        count = count + tmp
      } else {
        count = escapedStr.length
      }
      return count
    }
    
    0 讨论(0)
提交回复
热议问题