MySQL REGEXP query - accent insensitive search

点点圈 提交于 2019-11-29 09:42:15

You're out of luck:

Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:

SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
   OR wine_name LIKE 'Faugères %'
   OR wine_name LIKE '% Faugères'

As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.

You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).

You're certainly out of luck :)


Addendum: this has changed as of MySQL 8.0:

MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.

Because REGEXP and RLIKE are byte oriented, have you tried:

SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';

This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.

utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well. Also, change REGEXP to LIKE. REGEXP makes binary comparison.

Ok I just stumbled on this question while searching for something else.

This returns true.

SELECT 'Faugères' REGEXP 'Faug[eèêéë]+r[eèêéë]+s';

Hope it helps.

Adding the '+' Tells the regexp to look for one or more occurrences of the characters.

To solve this problem, I tried different things, including using the binary keyword or the latin1 character set but to no avail.
Finally, considering that it is a MySql bug, I ended up replacing the é and è chars,

Like this :

SELECT * 
FROM `table` 
WHERE replace(replace(wine_name, 'é', 'e'), 'è', 'e') REGEXP '[[:<:]]Faugeres[[:>:]]'
BigBud52

I had the same problem trying to find every record matching one of the following patterns: 'copropriété', 'copropriete', 'COPROPRIÉTÉ', 'Copropri?t?'

REGEXP 'copropri.{1,2}t.{1,2} worked for me. Basically, .{1,2} will should work in every case wether the character is 1 or 2 byte encoded.

Explanation: https://dev.mysql.com/doc/refman/5.7/en/regexp.html

Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

I have this problem, and went for Álvaro's suggestion above. But in my case, it misses those instances where the search term is the middle word in the string. I went for the equivalent of:

SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
   OR wine_name LIKE 'Faugères %'
   OR wine_name LIKE '% Faugères'
   OR wine_name LIKE '% Faugères %'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!