MySQL diacritic insensitive search (Arabic)

旧时模样 提交于 2019-12-23 12:36:45

问题


I have trouble making a diacritic insensitive search with arabic text.

I have tested multiple setups for the table in question: encodings in utf8 and utf16 as well as collations in utf8_general_ci, utf16_general_ci and utf16_unicode_ci.

The search works for åä special characters. I.e:

select * from test where text like '%a%'

Would return columns where text is a, å or ä. But it won't work with the Arabic diacritics. I.e if the text is بِسْمِ and I search for بسم, I don't get any hits.

Any ideas how to get pass this?

The real usage will later be PHP (a search function), but I'm working directly in the MySQL db just for testing before I port it over to PHP.

(from Comment)

CREATE TABLE test (
    ↵ id int(11) unsigned NOT NULL AUTO_INCREMENT,
    ↵ text text COLLATE utf8_unicode_ci,
    ↵ PRIMARY KEY (id)↵
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 

回答1:


SHOW COLLATIONS; to see what collations you have available. On my version, I don't see anything that looks tailored to Arabic. However, utf8_unicode_ci seems to do the folding you want. Here is a simple way to try it:

SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;

The result I got back was 1 (true), meaning they are considered equal. With utf8_general_ci it came back with 0, meaning not equal.

Then declare your fields to be VARCHAR(...) (or TEXT) CHARACTER SET utf8 COLLATE utf8_unicode_ci; Similarly for utf8mb4.

To build your own collation (and submit it for inclusion in future versions), see http://dev.mysql.com/doc/refman/5.6/en/adding-collation.html




回答2:


(This is not an "answer", but a "resolution".)

It seems that LIKE does not work with your Arabic string. I don't know how much more it fails on. I recommend you write a bug report at http://bugs.mysql.com . Here is a test case that shows that neither LIKE '...' nor LIKE '%...%' finds both strings, whereas '=' works:

CREATE  TABLE so28863402 (
    id int(11) unsigned NOT NULL AUTO_INCREMENT,
    txt text COLLATE utf8_unicode_ci,   -- deliberate choice of COLLATION
    PRIMARY KEY (id)
) ENGINE=InnoDB
        DEFAULT CHARSET=utf8;
INSERT INTO so28863402 (txt) VALUES
    (UNHEX('D8A8D990D8B3D992D985D990')),  -- Using hex to avoid any copy/paste issues
    (UNHEX('D8A8D8B3D985'));  -- The values should compare equal
SELECT id, txt, HEX(txt) FROM so28863402;
SELECT txt, COUNT(*) FROM so28863402 GROUP BY txt; -- GROUP BY finds them equal.
SELECT * from so28863402
    WHERE txt = 'بسم';   -- Finds both rows (correct)
SELECT * from so28863402
    WHERE txt LIKE '%بسم%';  -- Finds one row (incorrect)
-- Further checks:
SELECT * FROM so28863402 WHERE txt  =   UNHEX(  'D8A8D8B3D985'  );
SELECT * FROM so28863402 WHERE txt LIKE UNHEX(  'D8A8D8B3D985'  );
SELECT * FROM so28863402 WHERE txt LIKE UNHEX('25D8A8D8B3D98525'); -- x25 is '%'



回答3:


SELECT * FROM table name
WHERE MATCH (name of column in MYSQL )
AGAINST ('بسم ' IN BOOLEAN MODE);

This command ignore Diacritic. Try it.



来源:https://stackoverflow.com/questions/28863402/mysql-diacritic-insensitive-search-arabic

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!