How to detect Chinese Character in MySQL?

问题

I need to calculate the number of Chinese in a list of columns. For Example, if "北京实业" occur, this is four characters in Chinese but I only count once since it occurs in the column.

Is there any specific code to figure this out?

回答1:

SELECT COUNT(*)
    FROM tbl
    WHERE HEX(col) REGEXP '^(..)*(E[2-9F]|F0A)'

will count the number of record with Chinese characters in column col.

Problems:

I am not sure what ranges of hex represent Chinese.
The test may include Korean and Japanese. ("CJK")
In MySQL 4-byte Chinese characters need utf8mb4 instead of utf8.

Elaboration

I am assuming the column in the table is CHARACTER SET utf8. In utf8 encoding, Chinese characters begin with a byte between hex E2 and E9, or EF, or F0. Those starting with hex E will be 3 bytes long, but I am not checking the length; the F0 ones will be 4 bytes.

The regexp starts with ^(..)*, meaning "from the start of the string (^), locate 0 or more (*) 2-character (..) values. After that should be either E-something or F0A. After that, anything can occur. The E-something is, more specifically, E followed by any of 2,3,4,5,6,7,8,9, or F.

Picked at random, I see that 草 encodes as the 3 hex bytes E88D89, and 𠜎 encodes as the 4 hex bytes F0A09C8E.

I do not know of a better way to check a string for a specific language.

As you found, the REGEXP can be rather slow.

This regexp could be over-kill, in that some non-Chinese characters may be captured.

来源：https://stackoverflow.com/questions/35061775/how-to-detect-chinese-character-in-mysql

标签

mysql