Is there a way to get rid of accents and convert a whole string to regular letters?

前端 未结 12 1952
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-22 04:58

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?

12条回答
  •  心在旅途
    2020-11-22 05:16

    The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

    import java.text.Normalizer;
    
    public class Strip {
        public static String flattenToAscii(String string) {
            StringBuilder sb = new StringBuilder(string.length());
            string = Normalizer.normalize(string, Normalizer.Form.NFD);
            for (char c : string.toCharArray()) {
                if (c <= '\u007F') sb.append(c);
            }
            return sb.toString();
        }
    }
    

    Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

    public static String flattenToAscii(String string) {
        char[] out = new char[string.length()];
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        int j = 0;
        for (int i = 0, n = string.length(); i < n; ++i) {
            char c = string.charAt(i);
            if (c <= '\u007F') out[j++] = c;
        }
        return new String(out);
    }
    

    This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).

提交回复
热议问题