javascript+remove arabic text diacritic dynamically

前端 未结 5 490
栀梦
栀梦 2020-12-29 11:51

how to remove dynamically Arabic diacritic I\'m designing an ebook \"chm\" and have multi html pages contain Arabic text but some time the search engine want highlight so

相关标签:
5条回答
  • 2020-12-29 12:30

    I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.

    normalize_text = function(text) {
    
      //remove special characters
      text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
    
      //normalize Arabic
      text = text.replace(/(آ|إ|أ)/g, 'ا');
      text = text.replace(/(ة)/g, 'ه');
      text = text.replace(/(ئ|ؤ)/g, 'ء')
      text = text.replace(/(ى)/g, 'ي');
    
      //convert arabic numerals to english counterparts.
      var starter = 0x660;
      for (var i = 0; i < 10; i++) {
        text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
      }
    
      return text;
    }
    <input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
    <button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>

    0 讨论(0)
  • 2020-12-29 12:32

    Here's a javascript code that can handle removing Arabic diacritics nearly all the time.

    var arabicNormChar = {
        'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
    }
    
    var simplifyArabic  = function (str) {
        return str.replace(/[^\u0000-\u007E]/g, function(a){ 
            var retval = arabicNormChar[a]
            if (retval == undefined) {retval = a}
            return retval; 
        }).normalize('NFKD').toLowerCase();
    }
    
    //now you can use simplifyArabic(str) on Arabic strings to remove the diacritics
    

    Note: you may override the arabicNormChar to your own preferences.

    0 讨论(0)
  • 2020-12-29 12:35

    Use this regex to catch all tashkeel

    [ؐ-ًؚٟ]

    0 讨论(0)
  • 2020-12-29 12:48

    This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.

    If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:

    import unicodedata
    
    def _strip(text):
        return ''.join([c for c in unicodedata.normalize('NFD', text) \
            if unicodedata.category(c) != 'Mn'])
    
    composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
        u'\u0146\u0105\u013c\u012d\u017e\u0119'
    
    _strip(composed)
    'Internationalize'
    
    0 讨论(0)
  • 2020-12-29 12:51

    Try this

    Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
    converted to : الحمد لله رب العالمين 
    

    http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/

    The code is C# not javascript though. Still trying to figure out how to achieve this in javascript

    EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.

    var CHARCODE_SHADDA = 1617;
    var CHARCODE_SUKOON = 1618;
    var CHARCODE_SUPERSCRIPT_ALIF = 1648;
    var CHARCODE_TATWEEL = 1600;
    var CHARCODE_ALIF = 1575;
    
    function isCharTashkeel(letter)
    {
        if (typeof(letter) == "undefined" || letter == null)
            return false;
    
        var code = letter.charCodeAt(0);
        //1648 - superscript alif
        //1619 - madd: ~
        return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
    }
    
    function stripTashkeel(input)
    {
      var output = "";
      //todo consider using a stringbuilder to improve performance
      for (var i = 0; i < input.length; i++)
      {
        var letter = input.charAt(i);
        if (!isCharTashkeel(letter)) //tashkeel
          output += letter;                                
      }
    
    
    return output;                   
    }
    

    Edit: Here is another way to do it using BuckData http://qurandev.github.com/

    Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp

    0 讨论(0)
提交回复
热议问题