File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

前端 未结 6 1037
故里飘歌
故里飘歌 2020-11-30 02:24

I\'m struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related metho

相关标签:
6条回答
  • 2020-11-30 02:49

    I've seen something similar before. People that uploadde files from their Mac to a webapp used filenames with é.

    a) In OS that char is normal e + "sign for ´ applied to the previous char"

    b) In Windows it's a special char: é

    Both are Unicode. So... I understand you pass the (b) option to File create and at some point Mac OS converts it to the (a) option. Maybe if you find the double representation issue over the internet you can get a way to handle both situations successfully.

    Hope it helps!

    0 讨论(0)
  • 2020-11-30 02:55

    Solution extracted from question:

    Thanks to Stephen P for putting me on the right track.

    The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizer class to normalize strings into a common form of your choice, e.g.

    // Normalize to "Normalization Form Canonical Decomposition" (NFD)
    protected String normalizeUnicode(String str) {
        Normalizer.Form form = Normalizer.Form.NFD;
        if (!Normalizer.isNormalized(str, form)) {
            return Normalizer.normalize(str, form);
        }
        return str;
    }
    

    Since java.text.Normalizer is only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizer implementation and something like this reflection-based hack See also How does this normalize function work?

    This alone is enough for me to decide I won't support compilation of my project with Java 5 :|

    Here are other interesting things I learned in this sordid adventure.

    • The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.

    • Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the File.listFiles() method are NFC. The following mini-example demonstrates the differences:

      String name = "Trîcky Nåme";
      System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8"));
      System.out.println("NFC Normalized name: " + URLEncoder.encode(
          Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8"));
      System.out.println("NFD Normalized name: " + URLEncoder.encode(
          Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));
      

      Output:

      Original name: Tri%CC%82cky+Na%CC%8Ame
      NFC Normalized name: Tr%C3%AEcky+N%C3%A5me
      NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame
      
    • If you construct a File object with a string name, the File.getName() method will return the name in whatever form you gave it originally. However, if you call File methods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.

    • According to the quote below from Apple's documentation file names are stored in decomposed (NFD) form on the HFS Plus file system:

      When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.

      So the File.listFiles() method helpfully (?) converts file names to the (pre)composed (NFC) form.

    0 讨论(0)
  • 2020-11-30 02:57

    I suspect that you just have to instruct javac what encoding to use to compile the .java file containing the special characters with since you've hardcoded it in the source file. Otherwise the platform default encoding will be used, which may not be UTF-8 at all.

    You can use the VM argument -encoding for this.

    javac -encoding UTF-8 com/example/Foo.java

    This way the resulting .class file will end up containing the correct characters and you will be able to create and list the correct filename as well.

    0 讨论(0)
  • 2020-11-30 02:59

    On Unix file-system, a file name really is a null-terminated byte[]. So the java runtime has to perform conversion from java.lang.String to byte[] during the createNewFile() operation. The char-to-byte conversion is governed by the locale. I've been testing setting LC_ALL to en_US.UTF-8 and en_US.ISO-8859-1 and got coherent results. This is with Sun (...Oracle) java 1.6.0_20. However, For LC_ALL=en_US.POSIX, the result is:

    File name:   Tr%C3%AEcky+N%C3%A5me
    Listed name: Tr%3Fcky+N%3Fme
    

    3F is a question mark. It tells me that the conversion was not successful for the non-ASCII character. Then again, everything is as expected.

    But the reason why your two strings are different is because of the equivalence between the \u00EE character (or C3 AE in UTF-8) and the sequence i+\u0302 (69 CC 82 in UTF-8). \u0302 is a combining diacritical mark (combining circumflex accent). Some sort of normalization occurred during the file creation. I'm not sure if it's done in the Java run-time or the OS.

    NOTE: I took me some time to figure it out since the code snippet that you've posted do not have a combining diacritical mark but the equivalent character î (e.g. \u00ee). You should have embedded the Unicode escape sequence in the string literal (but it's easy to say that afterward...).

    0 讨论(0)
  • 2020-11-30 03:13

    An alternative solution is to use the new java.nio.Path api in place of the java.io.File api which works perfectly.

    0 讨论(0)
  • 2020-11-30 03:15

    Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

    You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

    i 0xCC 0x82 vs. 0xC3 0xAE
    a 0xCC 0x8A vs. 0xC3 0xA5
    

    That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicode\u0302 "combining circumflex accent" character while the second is UTF-8 for \u00EE "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

    OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

    See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

    See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

    A recent email thread on Apple's java-dev mailing list could be of some help to you.

    Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

    0 讨论(0)
提交回复
热议问题