Can't read UTF-8 filenames when launched as an Upstart service

后端 未结 2 1171
忘了有多久
忘了有多久 2021-02-04 06:39

My Java program reads the contents of a directory recursively. This is a sample tree (note the non-ASCII characters):

./sviluppo
./sviluppo/ciaò
./sviluppo/ciaò/         


        
2条回答
  •  温柔的废话
    2021-02-04 07:21

    Java uses a native call to list the contents of a directory. The underlying C runtime relies on the locale concept to build Java Strings from the byte blob stored by the filesystem as the filename.

    When you execute a Java program from a shell (either as a privileged user or an unprivileged one) it carries an environment made of variables. The variable LANG is read to transcode the stream of bytes to a Java String, and by default on Ubuntu it's associated to the UTF-8 encoding.

    Note that a process need not to be run from any shell, but looking at the code it seems that Upstart is smart enough to understand when the command in the configuration file is intended to be executed from a shell. So, assuming that the JVM is invoked through a shell, the problem is that the variable LANG is not set, so the C runtime assumes a default charset, which happens to not be UTF-8. The solution is in the Upstart stanza:

    description "List UTF-8 encoded filenames"
    author "Raffaele Sgarro"
    env LANG=en_US.UTF-8
    script
      cd /workspace
      java -jar list.jar test > log.txt
    end script
    

    I used en_US.UTF-8 as the locale, but any UTF-8 backed one will do just as well. The sources of the test list.jar

    public static void main(String[] args) {
        for (File file : new File(args[0]).listFiles()) {
            System.out.println(file.getName());
        }
    }
    

    The directory /workspace/test contains filenames like ààà, èèè and so on. Now you can move to the database part ;)

提交回复
热议问题