Why does usage of java.nio.files.File::list is causing this breadth-first file traversal program to crash with the “Too many open files” error?

为君一笑 提交于 2020-01-02 07:05:56

问题


Assumption:

Streams are lazy, hence the following statement does not load the entire children of the directory referenced by the path into memory; instead it loads them one by one, and after each invocation of forEach, the directory referenced by p is eligible for garbage collection, so its file descriptor should also become closed:

Files.list(path).forEach(p -> 
   absoluteFileNameQueue.add(
      p.toAbsolutePath().toString()
   )
);

Based on this assumption, I have implemented a breadth-first file traversal tool:

public class FileSystemTraverser {

    public void traverse(String path) throws IOException {
        traverse(Paths.get(path));
    }

    public void traverse(Path root) throws IOException {
        final Queue<String> absoluteFileNameQueue = new ArrayDeque<>();
        absoluteFileNameQueue.add(root.toAbsolutePath().toString());

        int maxSize = 0;
        int count = 0;

        while (!absoluteFileNameQueue.isEmpty()) {
            maxSize = max(maxSize, absoluteFileNameQueue.size());
            count += 1;
            Path path = Paths.get(absoluteFileNameQueue.poll());

            if (Files.isDirectory(path)) {
                Files.list(path).forEach(p ->
                        absoluteFileNameQueue.add(
                                p.toAbsolutePath().toString()
                        )
                );
            }

            if (count % 10_000 == 0) {
                System.out.println("maxSize = " + maxSize);
                System.out.println("count = " + count);
            }
        }

        System.out.println("maxSize = " + maxSize);
        System.out.println("count = " + count);
    }

}

And I use it in a fairly straightforward way:

public class App {

    public static void main(String[] args) throws IOException {
        FileSystemTraverser traverser = new FileSystemTraverser();
        traverser.traverse("/media/Backup");
    }

}

The disk mounted in /media/Backup has about 3 million files.

For some reason, around the 140,000 mark, the program crashes with this stack trace:

Exception in thread "main" java.nio.file.FileSystemException: /media/Backup/Disk Images/Library/Containers/com.apple.photos.VideoConversionService/Data/Documents: Too many open files
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
    at java.nio.file.Files.newDirectoryStream(Files.java:457)
    at java.nio.file.Files.list(Files.java:3451)

It seems to me for some reason the file descriptors are not getting closed or the Path objects are not garbage collected that causes the app to eventually crash.

System Details

  • OS: is Ubuntu 15.0.4
  • Kernel: 4.4.0-28-generic
  • ulimit: unlimited
  • File System: btrfs
  • Java runtime: tested with both of OpenJDK 1.8.0_91 and Oracle JDK 1.8.0_91

Any ideas what am I missing here and how can I fix this problem (without resorting to java.io.File::list (i.e. by staying within the ream of NIO2 and Paths)?


Update 1:

I doubt that JVM is keeping the file descriptors open. I took this heap dump around the 120,000 files mark:

Update 2:

I installed a file descriptor probing plugin in VisualVM and indeed it revealed that the FDs are not getting disposed of (as correctly pointed out by cerebrotecnologico and k5):


回答1:


Seems like the Stream returned from Files.list(Path) is not closed correctly. In addition you should not be using forEach on a stream you are not certain it is not parallel (hence the .sequential()).

    try (Stream<Path> stream = Files.list(path)) {
        stream.map(p -> p.toAbsolutePath().toString()).sequential().forEach(absoluteFileNameQueue::add);
    }



回答2:


From the Java documentation:

"The returned stream encapsulates a DirectoryStream. If timely disposal of file system resources is required, the try-with-resources construct should be used to ensure that the stream's close method is invoked after the stream operations are completed"




回答3:


The other answers give you the solution. I just want to correct this misapprehension in your question which is the root cause of your problem

... the directory referenced by p is eligible for garbage collection, so its file descriptor should also become closed.

This assumption is incorrect.

Yes, the directory (actually DirectoryStream) will be eligible for garbage collection. However, that does not mean that it will be garbage collected. The GC runs when the Java runtime system determines that it would be a good time to run it. Generally speaking, it takes no account of the number of open file descriptors that your application has created.

In other words, you should NOT rely on garbage collection and finalization to close resources. If you need a resource to be closed in a timely fashion, then your application should take care of this for itself. The "try-with-resources" construct is the recommended way to do it.


You commented:

I actually thought that because nothing references the Path objects and that their FDs are also closed, then the GC will remove them from the heap.

A Path object doesn't have a file descriptor. And if you look at the API, there isn't a Path.close() operation either.

The file descriptors that are being leaked in your example are actually associated with the DirectoryStream objects that are created by list(path). These objects will become eligible when the Stream.forEach() call completes.

My misunderstanding was that the FD of the Path objects are closed after each forEach invocation.

Well, that doesn't make sense; see above.

But even if it did make sense (i.e. if Path objects did have file descriptors), there is no mechanism for the GC to know that it needs to do something with the Path objects at that point.

Otherwise I know that the GC does not immediately remove eligible objects from the memory (hence the term "eligible").

That really >>is<< the root of the problem ... because the eligible file descriptor objects will >>only<< be finalized when the GC runs.



来源:https://stackoverflow.com/questions/38279074/why-does-usage-of-java-nio-files-filelist-is-causing-this-breadth-first-file-t

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!