Is there a workaround for Java's poor performance on walking huge directories?

前端 未结 10 762
予麋鹿
予麋鹿 2020-12-02 23:30

I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing

相关标签:
10条回答
  • 2020-12-03 00:10

    Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.

    If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.

    You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.

    Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.

    EDIT:

    30 minutes just to get a directory listing, wow.

    It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.

    If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.

    The interaction may actually slow things down, but maybe not--you might give it a try.

    Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing): good link in wayback (Jörg provided in a comment to replace this one from sun that Oracle broke)

    Anyway, the idea is straightforward but getting the code right is annoying. I'll go steal some codes from the internets and hack them up--brb

    
    /**
     * Note: Only use this as a last resort!  It's specific to windows and even
     * at that it's not a good solution, but it should be fast.
     * 
     * to use it, extend FileProcessor and call processFiles("...") with a list
     * of options if you want them like /s... I highly recommend /b
     * 
     * override processFile and it will be called once for each line of output.
     */
    import java.io.*;
    
    public abstract class FileProcessor
    {
       public void processFiles(String dirOptions)
       {
          Process theProcess = null;
          BufferedReader inStream = null;
    
          // call the Hello class
          try
          {
              theProcess = Runtime.getRuntime().exec("cmd /c dir " + dirOptions);
          }
          catch(IOException e)
          {
             System.err.println("Error on exec() method");
             e.printStackTrace();  
          }
    
          // read from the called program's standard output stream
          try
          {
             inStream = new BufferedReader(
                                    new InputStreamReader( theProcess.getInputStream() ));  
             processFile(inStream.readLine());
          }
          catch(IOException e)
          {
             System.err.println("Error on inStream.readLine()");
             e.printStackTrace();  
          }
    
       } // end method
       /** Override this method--it will be called once for each file */
       public abstract void processFile(String filename);
    
    
    } // end class
    

    And thank you code donor at IBM

    0 讨论(0)
  • 2020-12-03 00:10

    How about using File.list(FilenameFilter filter) method and implementing FilenameFilter.accept(File dir, String name) to process each file and return false.

    I ran this on Linux vm for directory with 10K+ files and it took <10 seconds.

    import java.io.File;  
    import java.io.FilenameFilter;
    
    public class Temp {
        private static void processFile(File dir, String name) {
            File file = new File(dir, name);
            System.out.println("processing file " + file.getName());
        }
    
        private static void forEachFile(File dir) {
            String [] ignore = dir.list(new FilenameFilter() {
                public boolean accept(File dir, String name) {
                    processFile(dir, name);
                    return false;
                }
            });
        }
    
        public static void main(String[] args) {
            long before, after;
            File dot = new File(".");
            before = System.currentTimeMillis();
            forEachFile(dot);
            after = System.currentTimeMillis();
            System.out.println("after call, delta is " + (after - before));
        }  
    }
    
    0 讨论(0)
  • 2020-12-03 00:10

    A non-portable solution would be to make native calls to the operating system and stream the results.

    For Linux

    You can look at something like readdir. You can walk the directory structure like a linked list and return results in batches or individually.

    For Windows

    In windows the behavior would be fairly similar using FindFirstFile and FindNextFile apis.

    0 讨论(0)
  • 2020-12-03 00:13

    I wonder why there are 10k files in a directory. Some file systems do not work well with so many files. There are specifics limitations for file systems like max amount of files per directory and max amount of levels of subdirectory.

    I solve a similar problem with an iterator solution.

    I needed to walk across huge directorys and several levels of directory tree recursively.

    I try FileUtils.iterateFiles() of Apache commons io. But it implement the iterator by adding all the files in a List and then returning List.iterator(). It's very bad for memory.

    So I prefer to write something like this:

    private static class SequentialIterator implements Iterator<File> {
        private DirectoryStack dir = null;
        private File current = null;
        private long limit;
        private FileFilter filter = null;
    
        public SequentialIterator(String path, long limit, FileFilter ff) {
            current = new File(path);
            this.limit = limit;
            filter = ff;
            dir = DirectoryStack.getNewStack(current);
        }
    
        public boolean hasNext() {
            while(walkOver());
            return isMore && (limit > count || limit < 0) && dir.getCurrent() != null;
        }
    
        private long count = 0;
    
        public File next() {
            File aux = dir.getCurrent();
            dir.advancePostition();
            count++;
            return aux;
        }
    
        private boolean walkOver() {
            if (dir.isOutOfDirListRange()) {
                if (dir.isCantGoParent()) {
                    isMore = false;
                    return false;
                } else {
                    dir.goToParent();
                    dir.advancePostition();
                    return true;
                }
            } else {
                if (dir.isCurrentDirectory()) {
                    if (dir.isDirectoryEmpty()) {
                        dir.advancePostition();
                    } else {
                        dir.goIntoDir();
                    }
                    return true;
                } else {
                    if (filter.accept(dir.getCurrent())) {
                        return false;
                    } else {
                        dir.advancePostition();
                        return true;
                    }
                }
            }
        }
    
        private boolean isMore = true;
    
        public void remove() {
            throw new UnsupportedOperationException();
        }
    
    }
    

    Note that the iterator stop by an amount of files iterateds and it has a FileFilter also.

    And DirectoryStack is:

    public class DirectoryStack {
        private class Element{
            private File files[] = null;
            private int currentPointer;
            public Element(File current) {
                currentPointer = 0;
                if (current.exists()) {
                    if(current.isDirectory()){
                        files = current.listFiles();
                        Set<File> set = new TreeSet<File>();
                        for (int i = 0; i < files.length; i++) {
                            File file = files[i];
                            set.add(file);
                        }
                        set.toArray(files);
                    }else{
                        throw new IllegalArgumentException("File current must be directory");
                    }
                } else {
                    throw new IllegalArgumentException("File current not exist");
                }
    
            }
            public String toString(){
                return "current="+getCurrent().toString();
            }
            public int getCurrentPointer() {
                return currentPointer;
            }
            public void setCurrentPointer(int currentPointer) {
                this.currentPointer = currentPointer;
            }
            public File[] getFiles() {
                return files;
            }
            public File getCurrent(){
                File ret = null;
                try{
                    ret = getFiles()[getCurrentPointer()];
                }catch (Exception e){
                }
                return ret;
            }
            public boolean isDirectoryEmpty(){
                return !(getFiles().length>0);
            }
            public Element advancePointer(){
                setCurrentPointer(getCurrentPointer()+1);
                return this;
            }
        }
        private DirectoryStack(File first){
            getStack().push(new Element(first));
        }
        public static DirectoryStack getNewStack(File first){
            return new DirectoryStack(first);
        }
        public String toString(){
            String ret = "stack:\n";
            int i = 0;
            for (Element elem : stack) {
                ret += "nivel " + i++ + elem.toString()+"\n";
            }
            return ret;
        }
        private Stack<Element> stack=null;
        private Stack<Element> getStack(){
            if(stack==null){
                stack = new Stack<Element>();
            }
            return stack;
        }
        public File getCurrent(){
            return getStack().peek().getCurrent();
        }
        public boolean isDirectoryEmpty(){
            return getStack().peek().isDirectoryEmpty();
        }
        public DirectoryStack downLevel(){
            getStack().pop();
            return this;
        }
        public DirectoryStack goToParent(){
            return downLevel();
        }
        public DirectoryStack goIntoDir(){
            return upLevel();
        }
        public DirectoryStack upLevel(){
            if(isCurrentNotNull())
                getStack().push(new Element(getCurrent()));
            return this;
        }
        public DirectoryStack advancePostition(){
            getStack().peek().advancePointer();
            return this;
        }
        public File[] peekDirectory(){
            return getStack().peek().getFiles();
        }
        public boolean isLastFileOfDirectory(){
            return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
        }
        public boolean gotMoreLevels() {
            return getStack().size()>0;
        }
        public boolean gotMoreInCurrentLevel() {
            return getStack().peek().getFiles().length > getStack().peek().getCurrentPointer()+1;
        }
        public boolean isRoot() {
            return !(getStack().size()>1);
        }
        public boolean isCurrentNotNull() {
            if(!getStack().isEmpty()){
                int currentPointer = getStack().peek().getCurrentPointer();
                int maxFiles = getStack().peek().getFiles().length;
                return currentPointer < maxFiles;
            }else{
                return false;
            }
        }
        public boolean isCurrentDirectory() {
            return getStack().peek().getCurrent().isDirectory();
        }
        public boolean isLastFromDirList() {
            return getStack().peek().getCurrentPointer() == (getStack().peek().getFiles().length-1);
        }
        public boolean isCantGoParent() {
            return !(getStack().size()>1);
        }
        public boolean isOutOfDirListRange() {
            return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
        }
    
    }
    
    0 讨论(0)
  • 2020-12-03 00:15

    I doubt the problem is relate to the bug report you referenced. The issue there is "only" memory usage, but not necessarily speed. If you have enough memory the bug is not relevant for your problem.

    You should measure whether your problem is memory related or not. Turn on your Garbage Collector log and use for example gcviewer to analyze your memory usage.

    I suspect that it has to do with the SMB protocol causing the problem. You can try to write a test in another language and see if it's faster, or you can try to get the list of filenames through some other method, such as described here in another post.

    0 讨论(0)
  • 2020-12-03 00:21

    If you need to eventually process all files, then having Iterable over String[] won't give you any advantage, as you'll still have to go and fetch the whole list of files.

    0 讨论(0)
提交回复
热议问题