I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'
An excellent reference for this is Markus Kuhn's UTF-8 and Unicode FAQ.
3 bytes
http://en.wikipedia.org/wiki/UTF-8#Description
Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)
When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:
110x xxxx
=> 2-byte sequence1110 xxxx
=> 3-byte sequence1111 0xxx
=> 4-byte sequenceAll subsequent bytes in the sequence must fit the 10xx xxxx
pattern.