问题
I need to calculate the number of Chinese in a list of columns. For Example, if "北京实业" occur, this is four characters in Chinese but I only count once since it occurs in the column.
Is there any specific code to figure this out?
回答1:
SELECT COUNT(*)
FROM tbl
WHERE HEX(col) REGEXP '^(..)*(E[2-9F]|F0A)'
will count the number of record with Chinese characters in column col
.
Problems:
- I am not sure what ranges of hex represent Chinese.
- The test may include Korean and Japanese. ("CJK")
- In MySQL 4-byte Chinese characters need
utf8mb4
instead ofutf8
.
Elaboration
I am assuming the column in the table is CHARACTER SET utf8
. In utf8 encoding, Chinese characters begin with a byte between hex E2 and E9, or EF, or F0. Those starting with hex E will be 3 bytes long, but I am not checking the length; the F0 ones will be 4 bytes.
The regexp starts with ^(..)*
, meaning "from the start of the string (^
), locate 0 or more (*
) 2-character (..
) values. After that should be either E
-something or F0A
. After that, anything can occur. The E-something is, more specifically, E
followed by any of 2,3,4,5,6,7,8,9, or F.
Picked at random, I see that 草
encodes as the 3 hex bytes E88D89
, and 𠜎
encodes as the 4 hex bytes F0A09C8E
.
I do not know of a better way to check a string for a specific language.
As you found, the REGEXP can be rather slow.
This regexp could be over-kill, in that some non-Chinese characters may be captured.
来源:https://stackoverflow.com/questions/35061775/how-to-detect-chinese-character-in-mysql