I\'m working on a UTF-8 Persian website with integrated mysql database. All the content in the website are imported through an admin panel and it\'s all persian.
As
I was struggling with the similar situation 5-6 years ago, when Lucene was not an option for MySQL and there were no Sphinx (Never tried Sphinx result on this), but what I did was I found pretty much most of the possible alternations and put them in an array in PHP. So if the input keyword contained any of those characters, I generated all the possible alternates of that.
So for the input of 'بازی' I would have generated {'بازي' , 'بازی' } and then I would query the MySQL for both, like the simplest query below :
SELECT title,Describtion FROM Games WHERE Description LIKE '%بازي%' OR Description LIKE '%بازی%'
The primary list of alternatives is not very long though.
The first letter (ي) is Yāʾ in the arabic alphabet. The second letter (ی) is ye in the perso-arabic alphabet.
More on the perso-arabic alphabet here: http://en.wikipedia.org/wiki/Perso-Arabic_alphabet
"Two dots are removed in the final ye (ی). Arabic differentiates the final yāʾ with the two dots and the alif maqsura (except in Egyptian Arabic), which is written like a final yāʾ without two dots.
Because Persian drops the two dots in the final ye, the alif maqsura cannot be differentiated from the normal final ye. For example, the name Musâ (Moses) is written موسی. In the final letter in Musâ, Persian does not differentiate between ye or an alif maqsura."
Seems to be an interesting problem...