Efficiently replace all accented characters in a string?

后端 未结 21 2593
别跟我提以往
别跟我提以往 2020-11-22 04:35

For a poor man\'s implementation of near-collation-correct sorting on the client side I need a JavaScript function that does efficient single character rep

21条回答
  •  -上瘾入骨i
    2020-11-22 05:29

    If you're looking specifically for a way to convert accented characters to non-accented characters, rather than a way to sort accented characters, with a little finagling, the String.localeCompare function can be manipulated to find the basic latin characters that match the extended ones. For example, you might want to produce a human friendly url slug from a page title. If so, you can do something like this:

    var baseChars = [];
    for (var i = 97; i < 97 + 26; i++) {
      baseChars.push(String.fromCharCode(i));
    }
    
    //if needed, handle fancy compound characters
    baseChars = baseChars.concat('ss,aa,ae,ao,au,av,ay,dz,hv,lj,nj,oi,ou,oo,tz,vy'.split(','));
    
    function isUpperCase(c) { return c !== c.toLocaleLowerCase() }
    
    function toBaseChar(c, opts) {
      opts = opts || {};
      //if (!('nonAlphaChar' in opts)) opts.nonAlphaChar = '';
      //if (!('noMatchChar' in opts)) opts.noMatchChar = '';
      if (!('locale' in opts)) opts.locale = 'en';
    
      var cOpts = {sensitivity: 'base'};
    
      //exit early for any non-alphabetical character
      if (c.localeCompare('9', opts.locale, cOpts) <= 0) return opts.nonAlphaChar === undefined ? c : opts.nonAlphaChar;
    
      for (var i = 0; i < baseChars.length; i++) {
        var baseChar = baseChars[i];
    
        var comp = c.localeCompare(baseChar, opts.locale, cOpts);
        if (comp == 0) return (isUpperCase(c)) ? baseChar.toUpperCase() : baseChar;
      }
    
      return opts.noMatchChar === undefined ? c : opts.noMatchChar;
    }
    
    function latinify(str, opts) {
      return str.replace(/[^\w\s\d]/g, function(c) {
        return toBaseChar(c, opts);
      })
    }
    
    // Example:
    console.log(latinify('Čeština Tsėhesenėstsestotse Tshivenḓa Emigliàn–Rumagnòl Slovenščina Português Tiếng Việt Straße'))
    
    // "Cestina Tsehesenestsestotse Tshivenda Emiglian–Rumagnol Slovenscina Portugues Tieng Viet Strasse"

    This should perform quite well, but if further optimization were needed, a binary search could be used with localeCompare as the comparator to locate the base character. Note that case is preserved, and options allow for either preserving, replacing, or removing characters that aren't alphabetical, or do not have matching latin characters they can be replaced with. This implementation is faster and more flexible, and should work with new characters as they are added. The disadvantage is that compound characters like 'ꝡ' have to be handled specifically, if they need to be supported.

提交回复
热议问题