What is the best way to convert a string from Unicode to ASCII without changing it\'s length (that is very important in my case)? Also the characters without any conversion
One isssue with Normalizer is that pre Java 1.6 its in sun.text package whereas in 1.6 its in java.text package and it method signature has changed. So if your application neeeds to run on both platforms you'll have to use reflection.
An alternative custom solution is described as techniwue 3 here
As stated in this answer, the following code should work:
String s = "口水雞 hello Ä";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";
String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");
System.out.println(s2);
System.out.println(s.length() == s2.length());
Output is
??? hello A
true
So you first remove diactrical marks, the convert to ascii. Non-ascii characters will become question marks.
Use java.text.Normalizer.normalize() with Normalizer.Form.NFD
, then filter out the non-ASCII characters.
As Paul Taylor mentioned: there is issue with using Normalizer if you need the project to be compilable/runnable in pre-1.6 and also in 1.6 and higher java. You will get into troubles since Normalizer is in different packages (java.text.Normalizer
(for 1.6) instead of sun.text.Normalizer
(for pre-1.6)) and has different method-signature.
Usually it is recommended to use reflection to invoke appropriate Normalizer.normalize() method. (Example could be found here).
But if you don't want to put reflection-mess in your code, you can use icu4j library. It contains com.ibm.icu.text.Normalizer
class with normalize()
method that perform the same job as java.text.Normalizer/sun.text.Normalizer. Icu library has (should have) own implementation of Normalizer so you can share your project with library and that should be java-independent.
Disadvantage is that the icu library is quite big.
If you using Normalizer class just for removing accents/diacritics from Strings, there's also another way. You can use Apache commons lang library (ver. 3) that contains StringUtils
with method stripAccents()
:
String noAccentsString = org.apache.commons.lang3.StringUtils.stripAccents(s);
Lang3 library probably use reflection to invoke appropriate Normalizer according to java version. So advantage is that you don't have reflection mess in your code.
Caveat: I don't know Java. Just a bit about character sets.
You are not stating which character set you are using exactly.
But no matter which you use, it's impossible to convert a Unicode string to ASCII and retain the original length and character positions, simply because a Unicode character set will use multiple bytes for some characters (obviously).
The only exception I know of would be a UTF-8 string that contains only ASCII characters: This string will already be identical in both UTF-8 and ASCII, because UTF-8 uses multibyte characters only when necessary. (I don't know about the other Unicode flavours, there may be other dynamic ones).
The only workaround I can see is adding a space to any special character that was replaced by an ASCII one, but that will screw up the string (Göteborg
in UTF8 would have to become Go teborg
to keep the length).
Maybe you want to elaborate on what you want to / need to achieve, so people here can suggest workarounds.