How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

后端 未结 3 1119
灰色年华
灰色年华 2021-02-19 17:30

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+

相关标签:
3条回答
  • I'm not dead-set on using MySQL

    Postgres seems to handle it quite fine:

    test=# select 'ˌˈʔ' like '___';
     ?column? 
    ----------
     t
    (1 row)
    
    test=# select 'ˌˈʔ' ~ '^.{3}$';
     ?column? 
    ----------
     t
    (1 row)
    

    If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)


    For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.

    0 讨论(0)
  • 2021-02-19 17:58

    EDITED to incorporate fix to valid critisism

    Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:

    select * from mytable
    where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
    

    The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.

    This works for "normal" columns too, you just don't need it.

    p.s. @Kieren's (valid) point addressed using rlike to enforce char pairs

    0 讨论(0)
  • 2021-02-19 18:09

    You have problems with UTF8? Eliminate them.

    How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.

    If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "@ba@ab". 2. Avoid finding "@a" in "@ab" use fixed length tokens, say, 2.

    0 讨论(0)
提交回复
热议问题