Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll()
method and replacing letters one by one?
EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer
(introduced in Java 6) instead of translation table inside the loop.
While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):
/**
* Mirror of the unicode table from 00c0 to 017f without diacritics.
*/
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
"DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
"aaaaaaaceeeeiiii" +
"\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzF";
/**
* Returns string without diacritics - 7 bit approximation.
*
* @param source string to convert
* @return corresponding string without diacritics
*/
public static String removeDiacritic(String source) {
char[] vysl = new char[source.length()];
char one;
for (int i = 0; i < source.length(); i++) {
one = source.charAt(i);
if (one >= '\u00c0' && one <= '\u017f') {
one = tab00c0.charAt((int) one - '\u00c0');
}
vysl[i] = one;
}
return new String(vysl);
}
Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.
Enjoy :-)