mysql case sensitive in utf8_general_ci

后端未结

关注

 3  2045

I\'ve a mysql database where i use utf8_general_ci (that is case insensitive), and in my tables i have some columns like ID with case-sensitive data (example: \'iSZ6fX\' or \'As

相关标签:

3条回答

情书的邮戳

2021-02-13 14:33

It is better to use columns with 'utf8_bin' rather than specifying the condition in query, because it reduces the chances of errors.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-02-13 14:37
It is better to use the utf8_bin collation because, even though it is not possible in UTF-8, in the general case it is theoretically possible (such as happens with UTF-16) for the same string to be represented by different encodings, which a binary comparison would not understand but a binary collation would. As documented under Unicode Character Sets:
There is a difference between “ordering by the character's code value” and “ordering by the character's binary representation,” a difference that appears only with utf16_bin, because of surrogates.

Suppose that utf16_bin (the binary collation for utf16) was a binary comparison “byte by byte” rather than “character by character.” If that were so, the order of characters in utf16_bin would differ from the order in utf8_bin. For example, the following chart shows two rare characters. The first character is in the range E000-FFFF, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.
```
Code point  Character                    utf8         utf16
----------  ---------                    ----         -----
0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84
```
The two characters in the chart are in order by code point value because 0xff9d < 0x10384. And they are in order by utf8 value because 0xef < 0xf0. But they are not in order by utf16 value, if we use byte-by-byte comparison, because 0xff > 0xd8.

So MySQL's utf16_bin collation is not “byte by byte.” It is “by code point.” When MySQL sees a supplementary-character encoding in utf16, it converts to the character's code-point value, and then compares. Therefore, utf8_bin and utf16_bin are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: “UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer.”
Therefore, if comparisons involving these columns will always be case-sensitive, you should set the column's collation to utf8_bin (so that they will remain case sensitive even if you forget to specify otherwise in your query); or if only particular queries are case-sensitive, you could specify that the utf8_bin collation should be used using the COLLATE keyword:
```
SELECT * FROM table WHERE id = 'iSZ6fX' COLLATE utf8_bin
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-02-13 14:39
The effect of BINARY as a column attribute differs from its effect prior to MySQL 4.1. Formerly, BINARY resulted in a column that was treated as a binary string. A binary string is a string of bytes that has no character set or collation, which differs from a nonbinary character string that has a binary collation.

But Now

The BINARY operator casts the string following it to a binary string. This is an easy way to force a comparison to be done byte by byte rather than character by character. BINARY also causes trailing spaces to be significant. BINARY str is shorthand for CAST(str AS BINARY).

The BINARY attribute in character column definitions has a different effect. A character column defined with the BINARY attribute is assigned the binary collation of the column character set. Every character set has a binary collation. For example, the binary collation for the latin1 character set is latin1_bin, so if the table default character set is latin1, these two column definitions are equivalent:
```
CHAR(10) BINARY

CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
```
0 讨论(0)
发布评论:

提交评论
- 加载中...