How to retrieve a list of directories QUICKLY in Java?

后端 未结 14 935
轮回少年
轮回少年 2020-12-01 12:45

Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to u

相关标签:
14条回答
  • 2020-12-01 12:52

    if your OS is 'stable' give a try to JNA:

    • opendir/readdir on UNIX
    • FindFirstFile and related API on Windows
    • Java7 with NIO2

    these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.

    0 讨论(0)
  • 2020-12-01 12:56

    Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.

    Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration

    /symlinks/a/b/cde
    

    would link to

    /realfiles/abcde
    

    (where /realfiles is where your 150,000 files reside)

    You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.

    0 讨论(0)
  • 2020-12-01 12:58

    I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach

    for (File f : new File("C:\\").listFiles()) {
        if (f.isDirectory()) {
            continue;
        }        
    }
    

    And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here

    Benchmark                  Mode  Cnt  Score    Error  Units
    MyBenchmark.dir_listFiles  avgt    5  0.437 ?  0.064   s/op
    MyBenchmark.path_find      avgt    5  0.046 ?  0.001   s/op
    MyBenchmark.path_walkTree  avgt    5  1.702 ?  0.047   s/op
    

    Number come from execution of this code:

    java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1
    
    static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
    static final int nCycles = 50;
    
    public static class Counter {
        int countOfFiles;
        int countOfFolders;
    }
    
    @Benchmark
    public List<File> dir_listFiles() {
        List<File> files = new ArrayList<>(1000);
    
        for( int i = 0; i < nCycles; i++ ) {
            File dir = new File(testDir);
    
            files.clear();
            for (File f : dir.listFiles()) {
                if (f.isDirectory()) {
                    continue;
                }
                files.add(f);
            }
        }
        return files;
    }
    
    @Benchmark
    public List<Path> path_walkTree() throws Exception {
        final List<Path> files = new ArrayList<>(1000);
    
        for( int i = 0; i < nCycles; i++ ) {
            Path dir = Paths.get(testDir);
    
            files.clear();
            Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
                @Override
                public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
                    files.add(path);
                    return FileVisitResult.CONTINUE;
                }
    
                @Override
                public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1) 
                        throws IOException {
                    return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
                }
            });
        }
    
        return files;
    }
    
    @Benchmark
    public List<Path> path_find() throws Exception {
        final List<Path> files = new ArrayList<>(1000);
    
        for( int i = 0; i < nCycles; i++ ) {
            Path dir = Paths.get(testDir);
    
            files.clear();
            files.addAll(Files.find(dir, 1, (path, attrs) 
                    -> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
        }
    
        return files;
    }
    
    0 讨论(0)
  • 2020-12-01 13:01

    there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.

    0 讨论(0)
  • 2020-12-01 13:06

    There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.

    First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.

    Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.

    So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.

    The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.

    Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.

    0 讨论(0)
  • 2020-12-01 13:07

    As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().

    I learned the answer from here:

    https://www.baeldung.com/java-list-directory-files

    I'm using Java 1.8 on Windows 10.

    0 讨论(0)
提交回复
热议问题