How can I open files containing accents in Java?

前端 未结 6 1589
傲寒
傲寒 2020-12-01 18:44

(editing for clarification and adding some code)

Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a de

相关标签:
6条回答
  • 2020-12-01 19:05

    In the DirectoryStream usage then don't forget to close the stream (try-with-resources can help here)

    0 讨论(0)
  • 2020-12-01 19:15

    The Java system property file.encoding should match the console's character encoding. The property must be set when starting java on the command-line:

    java -Dfile.encoding=UTF-8 …
    

    Normally this happens automatically, because the console encoding is usually the platform default encoding, and Java will use the platform default encoding if you don't specify one explicitly.

    0 讨论(0)
  • 2020-12-01 19:20

    First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

    Second, the � is typical for the Unicode replacement character U+FFFD being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

    System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // �
    

    So there are two problems:

    1. Your JVM is reading those special characters as .
    2. Your console is using ISO-8859-1 to display characters.

    For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.


    Update: As per your update:

    byte[] textArray = f.getName().getBytes();
    

    That should have been the following to exclude influence of platform default encoding:

    byte[] textArray = f.getName().getBytes("UTF-8");
    

    If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

    0 讨论(0)
  • 2020-12-01 19:21

    It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters that failed to load using java.io... classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced apache FileUtils (which has the same problem) with java.nio.Files...

    0 讨论(0)
  • 2020-12-01 19:23

    Well I was strangled with this issue all the day! My previous (wrong) code was the same as you:

    for(File f : dir.listFiles()) {
     String filename = f.getName(); // The filename here is wrong !
     FileInputStream fis = new FileInputStream (filename);
    }
    

    and it does not work (I'm using Java 1.7 Oracle on CentOS 6, LANG and LC_CTYPE=fr_FR.UTF-8 for all users except zimbra => LANG and LC_CTYPE=C - which btw is certainly the cause of this issue but I can't change this without the risk that Zimbra stops working...)

    So I decided to use the new classes of java.nio.file package (Files and Paths):

    DirectoryStream<Path> paths = Files.newDirectoryStream(Paths.get(outputName));
    for (Iterator<Path> iterator = paths.iterator(); iterator.hasNext();) {
      Path path = iterator.next();
      String filename = path.getFileName().toString(); // The filename here is correct
      ...
    }
    

    So if you are using Java 1.7, you should give a try to new classes into java.nio.file package : it saved my day!

    Hope it helps

    0 讨论(0)
  • 2020-12-01 19:26

    It is a bug in JRE/JDK which exists for years.

    How to fix java when if refused to open a file with special charater in filename?

    File.exists() fails with unicode characters in name

    I am now re-submitting a new bug report to them as LC_ALL=en_us will fix some cases, meanwhile it will fail some other cases.

    0 讨论(0)
提交回复
热议问题