How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

后端未结

关注

 3  1119

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+

相关标签:

3条回答

不要未来只要你来

2021-02-19 17:56
I'm not dead-set on using MySQL

Postgres seems to handle it quite fine:
```
test=# select 'ˌˈʔ' like '___';
 ?column? 
----------
 t
(1 row)

test=# select 'ˌˈʔ' ~ '^.{3}$';
 ?column? 
----------
 t
(1 row)
```
If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)

For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2021-02-19 17:58
EDITED to incorporate fix to valid critisism

Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:
```
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
```
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.

This works for "normal" columns too, you just don't need it.

p.s. @Kieren's (valid) point addressed using rlike to enforce char pairs
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-19 18:09

You have problems with UTF8? Eliminate them.

How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.

If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "@ba@ab". 2. Avoid finding "@a" in "@ab" use fixed length tokens, say, 2.

0 讨论(0)
发布评论:

提交评论
- 加载中...