(editing for clarification and adding some code)
Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a de
In the DirectoryStream usage then don't forget to close the stream (try-with-resources can help here)
The Java system property file.encoding
should match the console's character encoding. The property must be set when starting java
on the command-line:
java -Dfile.encoding=UTF-8 …
Normally this happens automatically, because the console encoding is usually the platform default encoding, and Java will use the platform default encoding if you don't specify one explicitly.
First, the character encoding used is not directly related to the locale. So changing the locale won't help much.
Second, the �
is typical for the Unicode replacement character U+FFFD �
being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:
System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // �
So there are two problems:
�
.For a Sun JVM, the VM argument -Dfile.encoding=UTF-8
should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.
Update: As per your update:
byte[] textArray = f.getName().getBytes();
That should have been the following to exclude influence of platform default encoding:
byte[] textArray = f.getName().getBytes("UTF-8");
If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version
. As said before, the -Dfile.encoding
argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.
It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters that failed to load using java.io... classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced apache FileUtils (which has the same problem) with java.nio.Files...
Well I was strangled with this issue all the day! My previous (wrong) code was the same as you:
for(File f : dir.listFiles()) {
String filename = f.getName(); // The filename here is wrong !
FileInputStream fis = new FileInputStream (filename);
}
and it does not work (I'm using Java 1.7 Oracle on CentOS 6, LANG and LC_CTYPE=fr_FR.UTF-8 for all users except zimbra => LANG and LC_CTYPE=C - which btw is certainly the cause of this issue but I can't change this without the risk that Zimbra stops working...)
So I decided to use the new classes of java.nio.file package (Files and Paths):
DirectoryStream<Path> paths = Files.newDirectoryStream(Paths.get(outputName));
for (Iterator<Path> iterator = paths.iterator(); iterator.hasNext();) {
Path path = iterator.next();
String filename = path.getFileName().toString(); // The filename here is correct
...
}
So if you are using Java 1.7, you should give a try to new classes into java.nio.file package : it saved my day!
Hope it helps
It is a bug in JRE/JDK which exists for years.
How to fix java when if refused to open a file with special charater in filename?
File.exists() fails with unicode characters in name
I am now re-submitting a new bug report to them as LC_ALL=en_us will fix some cases, meanwhile it will fail some other cases.