I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+
I'm not dead-set on using MySQL
Postgres seems to handle it quite fine:
test=# select 'ˌˈʔ' like '___';
?column?
----------
t
(1 row)
test=# select 'ˌˈʔ' ~ '^.{3}$';
?column?
----------
t
(1 row)
If you go down that road, note that in Postgres' ilike
operator matches that of MySQL's like
. (In Postgres, like
is case-sensitive.)
For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.
EDITED to incorporate fix to valid critisism
Use the HEX()
function to render your bytes to hexadecimal and then use RLIKE
on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. @Kieren's (valid) point addressed using rlike
to enforce char pairs
You have problems with UTF8? Eliminate them.
How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.
If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "@ba@ab". 2. Avoid finding "@a" in "@ab" use fixed length tokens, say, 2.