Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll()
method and replacing letters one by one?
The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).