How can I search by emoji in MySQL using utf8mb4?

混江龙づ霸主 提交于 2019-11-30 08:28:24
t.niese

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji 🌮 and 🌶 are correctly identified as different letters.

With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

If you write:

SELECT
  WEIGHT_STRING ('🌮' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('🌶' COLLATE 'utf8mb4_unicode_ci')

Then you can see that both are 0xfffd. In Unicode Character Sets they say:

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

If you write:

SELECT 
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_bin'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_bin')

You will get their unicode values 0x01f32e and 0x01f336 instead.

For other letters like Ä, Á and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

SELECT
  WEIGHT_STRING ('Ä' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')

Those map to to the weight 0x0E33

Ä: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A

According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

SELECT
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_unicode_520_ci'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_unicode_520_ci')

there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!