What is the difference between utf8mb4
and utf8
charsets in MySQL?
I already know about ASCII, UTF-8, UTF-16
MySQL added this utf8mb4 code after 5.5.3, Mb4 is the most bytes 4 meaning, specifically designed to be compatible with four-byte Unicode. Fortunately, UTF8MB4 is a superset of UTF8, except that there is no need to convert the encoding to UTF8MB4. Of course, in order to save space, the general use of UTF8 is enough.
The original UTF-8 format uses one to six bytes and can encode 31 characters maximum. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, just to represent all 17 Unicode planes. UTF8 is a character set in Mysql that supports only a maximum of three bytes of UTF-8 characters, which is the basic multi-text plane in Unicode.
To save 4-byte-long UTF-8 characters in Mysql, you need to use the UTF8MB4 character set, but only 5.5. After 3 versions are supported (View version: Select version ();). I think that in order to get better compatibility, you should always use UTF8MB4 instead of UTF8. For char type data, UTF8MB4 consumes more space and, according to Mysql's official recommendation, uses VARCHAR instead of char.
In MariaDB utf8mb4 as the default CHARSET when it not set explicitly in the server config, hence COLLATE utf8mb4_unicode_ci is used.
Refer MariaDB CHARSET & COLLATE Click
CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
In MySQL, utf8
refers to a flawed implementation of the UTF-8 standard in which not all character ranges are supported.
Specifically, only characters in the basic multilingual plane work, with other characters considered invalid. This is because the values within that plane - 0 to 65535 (some of which are reserved for special reasons) can be represented by multi-byte encodings in UTF-8 of up to 3 bytes, and MySQL's take on UTF-8 arbitrarily decided to set that as a limit.
Back when MySQL released this, that wasn't much of a problem. Since then, more and more newly defined character ranges have been added to Unicode with values outside the basic multilingual plane.
In an effort not to break old code making any particular assumptions, MySQL retained the broken implementation and called the newer, fixed version utf8mb4
. This has led to some confusion with the name being misinterpreted as if it's some kind of extension to UTF-8, rather than MySQL's official true implementation of UTF-8.
Future versions of MySQL may eventually phase out the older version, but for the forseeable future utf8mb4
is to be used instead to ensure correct UTF-8 encoding.
Some may take issue to me describing the older, non-compliant implementation as flawed or broken. But, it is true that by only allowing multi-byte encodings of up to 3 bytes it never correctly followed the UTF-8 standard as it existed at any point in time and that it is the reason for its flaws. At no point was UTF-8 defined as supporting up to 3 bytes: The only time it was not defined as being up to 4 bytes was when it was originally defined as being up to 6 bytes (!!) - which subsequent Unicode specs have decided was overkill.
Taken from the MySQL 8.0 Reference Manual:
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
In MySQL utf8
is currently an alias for utf8mb3
which is deprecated and will be removed in a future MySQL release. At that point utf8
will become a reference to utf8mb4
.
So regardless of this alias, you can consciously set yourself an utf8mb4
encoding.
To complete the answer, I'd like to add the @WilliamEntriken's comment below (also taken from the manual):
To avoid ambiguity about the meaning of
utf8
, consider specifyingutf8mb4
explicitly for character set references instead ofutf8
.
UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.
So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.
This is what (a previous version of the same page at) the MySQL documentation has to say about it:
The character set named utf8[/utf8mb3] uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:
For a BMP character, utf8[/utf8mb3] and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8[/utf8mb3] cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8[/utf8mb3] cannot store the character at all, you do not have any supplementary characters in utf8[/utf8mb3] columns and you need not worry about converting characters or losing data when upgrading utf8[/utf8mb3] data from older versions of MySQL.
So if you want your column to support storing characters lying outside the BMP (and you usually want to), such as emoji, use "utf8mb4". See also What are the most common non-BMP Unicode characters in actual use?.
The utf8mb4
character set is useful because nowadays we need support for storing not only language characters but also symbols, newly introduced emojis, and so on.
A nice read on How to support full Unicode in MySQL databases by Mathias Bynens can also shed some light on this.