how to remove dynamically Arabic diacritic I\'m designing an ebook \"chm\" and have multi html pages contain Arabic text but some time the search engine want highlight so
I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.
normalize_text = function(text) {
//remove special characters
text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
//normalize Arabic
text = text.replace(/(آ|إ|أ)/g, 'ا');
text = text.replace(/(ة)/g, 'ه');
text = text.replace(/(ئ|ؤ)/g, 'ء')
text = text.replace(/(ى)/g, 'ي');
//convert arabic numerals to english counterparts.
var starter = 0x660;
for (var i = 0; i < 10; i++) {
text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
}
return text;
}
<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>
Here's a javascript code that can handle removing Arabic diacritics nearly all the time.
var arabicNormChar = {
'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}
var simplifyArabic = function (str) {
return str.replace(/[^\u0000-\u007E]/g, function(a){
var retval = arabicNormChar[a]
if (retval == undefined) {retval = a}
return retval;
}).normalize('NFKD').toLowerCase();
}
//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics
Note: you may override the arabicNormChar to your own preferences.
Use this regex to catch all tashkeel
[ؐ-ًؚٟ]
This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.
If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:
import unicodedata
def _strip(text):
return ''.join([c for c in unicodedata.normalize('NFD', text) \
if unicodedata.category(c) != 'Mn'])
composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
u'\u0146\u0105\u013c\u012d\u017e\u0119'
_strip(composed)
'Internationalize'
Try this
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/
The code is C# not javascript though. Still trying to figure out how to achieve this in javascript
EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.
var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;
function isCharTashkeel(letter)
{
if (typeof(letter) == "undefined" || letter == null)
return false;
var code = letter.charCodeAt(0);
//1648 - superscript alif
//1619 - madd: ~
return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}
function stripTashkeel(input)
{
var output = "";
//todo consider using a stringbuilder to improve performance
for (var i = 0; i < input.length; i++)
{
var letter = input.charAt(i);
if (!isCharTashkeel(letter)) //tashkeel
output += letter;
}
return output;
}
Edit: Here is another way to do it using BuckData http://qurandev.github.com/
Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp