A
in UTF-8 is U+0041 LATIN CAPITAL LETTER A
. A
in ASCII is 065
.
How is UTF-8 is backwards-compatible with ASCII?
ASCII uses only the first 7 bits of an 8 bit byte. So all combinations from 00000000
to 01111111
. All 128 bytes in this range are mapped to a specific character.
UTF-8 keep these exact mappings. The character represented by 01101011
in ASCII is also represented by the same byte in UTF-8. All other characters are encoded in sequences of multiple bytes in which each byte has the highest bit set; i.e. every byte of all non-ASCII characters in UTF-8 is of the form 1xxxxxxx
.
Unicode is backward compatible with ASCII because ASCII is a subset of Unicode. Unicode simply uses all character codes in ASCII and adds more.
Although character codes are generally written out as 0041 in Unicode, the character codes are numeric so 0041 is the same value as (hexadecimal) 41.
UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.
Note that it's only the 7-bit ASCII character set that is compatible with Unicode and UTF-8, the 8-bit character sets based on ASCII, like IBM850 and windows-1250, uses the part of the character set where UTF-8 has codes for multiple byte encodings.
Why:
Because everything was already in ASCII and have a backwards compatible Unicode format made adoption much easier. It's much easier to convert a program to use UTF-8 than it is to UTF-16, and that program inherits the backwards compatible nature by still working with ASCII.
How:
ASCII is a 7 bit encoding, but is always stored in bytes, which are 8 bit. That means 1 bit has always been unused.
UTF-8 simply uses that extra bit to signify non-ASCII characters.