SQL doesnt differentiate u and ü although collation is utf8mb4_unicode_ci

核能气质少年 提交于 2019-11-29 10:05:49

Collation and character set are two different things.

Character set is just an 'unordered' list of characters and their representation. utf8mb4 is a character set and covers a lots of characters.

Collation defines the order of characters (determines the end result of order by for example) and defines other rules (such as which characters or character combinations should be treated as same). Collations are derived from character sets, there can be more than one collation for the same character set. (It is an extension to the character set - sorta)

In utf8mb4_unicode_ci all (most?) accented characters are treated as the same character, this is why you get u and ü. In short this collation is an accent insensitive collation.

This is similar to the fact that German collations treat ss and ß as same.

utf8mb4_bin is another collation and it treats all characters as different ones. You may or may not want to use it as default, this is up to you and your business rules.

You can also convert the collation in queries, but be aware, that doing so will prevent MySQL to use indexes.

Here is an example using a similar, but maybe a bit more familiar part of collations:

The ci at the end of the collations means Case Insensitive and almost all collations with ci has a pair ending with cs, meaning Case Sensitive.

When your column is case insensitive, the where condition column = 'foo' will find all of these: foo Foo fOo FoO FOo FoO fOO, FOO.

Now if you try to set the collation to case sensitive (utf8mb4_unicode_cs for example), all the above values are treated as different values.

The localized collations (like German, UK, US, Hungarian, whatever) follow the rules of the named language. In Germany ss and ß are the same and this is stated in the rules of the German language. When a German user searches for a value Straße, they will expect that a software (supporting german language or written in Germany) will return both Straße and Strasse.

To go further, when it comes to ordering, the two words are the same, they are equal, their meaning is the same so there is no particular order.

Don't forget, that the UNIQUE constraint is just a way of ordering/filtering values. So if there is a unique key defined on a column with German collation, it will not allow to insert both Straße and Strasse, since by the rules of the language, they should be treated as equal.

Now lets see our original collation: utf8mb4_unicode_ci, This is a 'universal' collation, which means, that it tries to simplify everything so since ü is not a really common character and most users have no idea how to type it in, this collation makes it equal to u. This is a simplification in order to support most of the languages, but as you already know, these kind of simplifications have some side effects. (like in ordering, filtering, using unique constraints, etc).

The utf8mb4_bin is the other end of the spectrum. This collation is designed to be as strict as it can be. To achieve this, it literally uses the character codes to distinguish characters. This means, each and every form of a character are different, this collation is implicitly case sensitive and accent sensitive.

Both of these have drawbacks: the localized and general collations are designed for one specific language or to provide a common solution. (utf8mb4_unicode_ci is the 'extension' of the old utf8_general_ci collation)

The binary requires extra caution when it comes to user interaction. Since it is CS and AS it can confuse users who are used to get the value 'Foo' when they are looking for the value 'foo'. Also as a developer, you have to be extra cautious when it comes to joins and other features. The INNER JOIN 'foo' = 'Foo' will return nothing, since 'foo' is not equal to 'Foo'.

I hope that these examples and explanation helps a bit.

utf8_collations.html lists what letters are 'equal' in the various utf8 (or utf8mb4) collations. With rare exceptions, all accents are stripped before comparing in any ..._ci collation. Some of the exceptions are language-specific, not Unicode in general. Example: In Icelandic É > E.

..._bin is the only collation that honors the treats accented letters as different. Ditto for case folding.

If you are doing a lot of comparing, you should change the collation of the column to ..._bin. When using the COLLATE clause in WHERE, an index cannot be used.

A note on ß. ss = ß in virtually all collations. In particular, utf8_general_ci (which used to be the the default) treated them as unequal. That one collation made no effort to treat any 2-letter combination (ss) as a single 'letter'. Also, due to a mistake in 5.0, utf8_general_mysql500_ci treats them unequal.

Going forward, utf8mb4_unicode_520_ci is the best through version 5.7. For 8.0, utf8mb4_0900_ai_ci is 'better'. The "520" and "900" refer to Unicode standards, so there may be even newer ones in the future.

You can try the utf8_bin collation and you shouldn't face this issue, but it will be case sensitive. The bin collations compare strictly, only separating the characters out according to the encoding selected, and once that's done, comparisons are done on a binary basis, much like many programming languages would compare strings.

Sea Coast of Tibet

I'll just add to the other answers that a _bin collation has its peculiarities as well.

For example, after the following:

CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');

this will fail:

INSERT INTO `dummy` (`key`) VALUES ('one ');

This is described in The binary Collation Compared to _bin Collations.

Edit: I've posted a related question here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!