Java Can't Open a File with Surrogate Unicode Values in the Filename?

前端 未结 4 575
一整个雨季
一整个雨季 2020-12-03 17:54

I\'m dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I\'m working on a Mac with Java 1.5, and

相关标签:
4条回答
  • 2020-12-03 18:18

    I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

    “Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

    \xF0\xA6\xBF\xB6
    

    it outputs a UTF-8-encoded sequence for each of the surrogates:

    \xED\xA1\x9B\xED\xBF\xB6
    

    This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

    So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

    $ python
    Python 2.x.something (blah blah)
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> os.listdir('.')
    

    On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

    ['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
    

    which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

    ['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
    

    it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

    os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')
    
    0 讨论(0)
  • 2020-12-03 18:24

    If your environment's default locale does not include those characters you cannot open the file.

    See: File.exists() fails with unicode characters in name

    Edit: Alright.. What you need is to change the system locale. Whatever OS you are using.

    Edit:

    See: How can I open files containing accents in Java?

    See: JFileChooser on Mac cannot see files named by Chinese chars?

    0 讨论(0)
  • 2020-12-03 18:37

    This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6). Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class. I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick). I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string. Doesn't look like it's something you can really get around.

    0 讨论(0)
  • 2020-12-03 18:43

    It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

    ...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

    0 讨论(0)
提交回复
热议问题