Is there a way to get rid of accents and convert a whole string to regular letters?

前端 未结 12 1948
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-22 04:58

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?

相关标签:
12条回答
  • 2020-11-22 05:07

    @David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

    public static String flattenToAscii(String string) {
        char[] out = new char[string.length()];
        String norm = Normalizer.normalize(string, Normalizer.Form.NFD);
    
        int j = 0;
        for (int i = 0, n = norm.length(); i < n; ++i) {
            char c = norm.charAt(i);
            int type = Character.getType(c);
    
            //Log.d(TAG,""+c);
            //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
            if (type != Character.NON_SPACING_MARK){
                out[j] = c;
                j++;
            }
        }
        //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
        return new String(out);
    }
    
    0 讨论(0)
  • 2020-11-22 05:16

    The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

    import java.text.Normalizer;
    
    public class Strip {
        public static String flattenToAscii(String string) {
            StringBuilder sb = new StringBuilder(string.length());
            string = Normalizer.normalize(string, Normalizer.Form.NFD);
            for (char c : string.toCharArray()) {
                if (c <= '\u007F') sb.append(c);
            }
            return sb.toString();
        }
    }
    

    Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

    public static String flattenToAscii(String string) {
        char[] out = new char[string.length()];
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        int j = 0;
        for (int i = 0, n = string.length(); i < n; ++i) {
            char c = string.charAt(i);
            if (c <= '\u007F') out[j++] = c;
        }
        return new String(out);
    }
    

    This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).

    0 讨论(0)
  • 2020-11-22 05:18

    Use java.text.Normalizer to handle this for you.

    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    // or Normalizer.Form.NFKD for a more "compatable" deconstruction 
    

    This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

    string = string.replaceAll("[^\\p{ASCII}]", "");
    

    If your text is in unicode, you should use this instead:

    string = string.replaceAll("\\p{M}", "");
    

    For unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

    Thanks to GarretWilson for the pointer and regular-expressions.info for the great unicode guide.

    0 讨论(0)
  • 2020-11-22 05:19

    One of the best way using regex and Normalizer if you have no library is :

        public String flattenToAscii(String s) {
                    if(s == null || s.trim().length() == 0)
                            return "";
                    return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
    }
    

    This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).

    Otherwise, you have to use the p{ASCII} pattern.

    Regards.

    0 讨论(0)
  • 2020-11-22 05:19

    In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:

       fun stripAccents(s: String):String{
    
       if (s == null) {
          return "";
       }
    
    val chars: CharArray = s.toCharArray()
    
    var sb = StringBuilder(s)
    var cont: Int = 0
    
    while (chars.size > cont) {
        var c: kotlin.Char
        c = chars[cont]
        var c2:String = c.toString()
       //these are my needs, in case you need to convert other accents just Add new entries aqui
        c2 = c2.replace("Ã", "A")
        c2 = c2.replace("Õ", "O")
        c2 = c2.replace("Ç", "C")
        c2 = c2.replace("Á", "A")
        c2 = c2.replace("Ó", "O")
        c2 = c2.replace("Ê", "E")
        c2 = c2.replace("É", "E")
        c2 = c2.replace("Ú", "U")
    
        c = c2.single()
        sb.setCharAt(cont, c)
        cont++
    
    }
    
    return sb.toString()
    

    }

    to use these fun cast the code like this:

         var str: String
         str = editText.text.toString() //get the text from EditText
         str = str.toUpperCase().trim()
    
         str = stripAccents(str) //call the function
    
    0 讨论(0)
  • 2020-11-22 05:20
    System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
    

    worked for me. The output of the snippet above gives "aee" which is what I wanted, but

    System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));
    

    didn't do any substitution.

    0 讨论(0)
提交回复
热议问题