MySQL returns incorrect UTF8 extended characters in some cases only

て烟熏妆下的殇ゞ 提交于 2019-12-13 01:59:47

问题


Note: In the following question you may see ? or blocks instead of characters, this is because you don't have the appropriate font. Please ignore this.

Background

I have a table with data structured as follows:

CREATE TABLE `decomposition_dup` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `parent` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `structure` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `child` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`id`),
 KEY `parent` (`parent`),
 KEY `child` (`child`),
 KEY `parent_2` (`parent`,`child`)
) ENGINE=InnoDB AUTO_INCREMENT=211929 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

And some example data:

INSERT INTO `decomposition_dup` (`id`, `parent`, `structure`, `child`) VALUES
(154647, '锦', 'a', '钅'),
(154648, '锦', 'a', '帛'),
(185775, '钅', 'd', '二'),
(185774, '钅', 'd', '㇟'),
(21195, '钅', 'd', '𠂉'),
(21178, '⻐', 'd', '乇'),
(21177, '⻐', 'd', '𠂉');

And the charsets are all set correctly:

Problem

It is very important to note that:

  • 154647, 185775, 185774 & 21195 refer to this character: http://unicode.scarfboy.com/?s=%E9%92%85
  • 21178 and 21177 refer to this character: http://unicode.scarfboy.com/?s=%E2%BB%90

As you can see, they are different characters. However, in some cases they are treated as the same character.

Case 1

When I run the following query, it only returns the correct child (i.e. doesn't return the similar-looking but different character child):

SELECT *
FROM decomposition_dup
WHERE parent = '锦'

This is correct behaviour.

Case 2

However, when I run the following query using 钅 (http://unicode.scarfboy.com/?s=%E9%92%85) it returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '钅'

This should only return 185775, 185774 & 21195.

Case 3

And when I run the following query using ⻐ (http://unicode.scarfboy.com/?s=%E2%BB%90) it also returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '⻐'

This should only return 21178 and 21177.

Case 4

If I replace = with LIKE for the broken queries (i.e. Case 2 and Case 3), they return correctly.

For example, the following query is the same as Case 3 but usingLIKE:

SELECT *
FROM decomposition_dup
WHERE parent LIKE '⻐'

This returns the correct characters but slows down the query.

Question

Is this a bug in MySQL or is there something that I am overlooking when querying for UTF8 extended characters?


回答1:


If you want them to be the same, set the COLLATION of the columns to utf8mb4_unicode_ci or utf8mb4_unicode_520_ci.
If you want them to be different, use utf8mb4_general_ci, instead:

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_general_ci AS general;
+---------+
| general |
+---------+
|       0 |
+---------+

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_ci AS unicode;
+---------+
| unicode |
+---------+
|       1 |
+---------+

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_520_ci AS unicode_520;
+-------------+
| unicode_520 |
+-------------+
|           1 |
+-------------+



回答2:


From what I can make out the problem lies within the SQL side of things upon research you'll see that this error code means that

MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8.

so It might be the characters you are using within the cases of SQL



来源:https://stackoverflow.com/questions/35004672/mysql-returns-incorrect-utf8-extended-characters-in-some-cases-only

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!