How to list a 2 million files directory in java without having an “out of memory” exception

后端未结

关注

 15  1600

挽巷

I have to deal with a directory of about 2 million xml\'s to be processed.

I\'ve already solved the processing distributing the work between machines and threads us

相关标签:

15条回答

暗喜

2020-12-06 00:34

If Java 7 is not an option, this hack will work (for UNIX):

Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
    if (line.startsWith("."))
        continue;
    System.out.println(line);
}

The -f parameter will speed it up (from man ls):

-f     do not sort, enable -aU, disable -lst

0 讨论(0)

深忆病人

2020-12-06 00:34

Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.

It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2020-12-06 00:35
First of all, do you have any possibility to use Java 7? There you have a FileVisitor and the Files.walkFileTree, which should probably work within your memory constraints.

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter) with a filter that always returns false (ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000 then file0001000-filefile0002000 and so on.

~~If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.~~

Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:
```
public File[] listFiles(FilenameFilter filter) {
    String ss[] = list();
    if (ss == null) return null;
    ArrayList v = new ArrayList();
    for (int i = 0 ; i < ss.length ; i++) {
        if ((filter == null) || filter.accept(this, ss[i])) {
            v.add(new File(ss[i], this));
        }
    }
    return (File[])(v.toArray(new File[v.size()]));
}
```
so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

Btw, could you give an example of a file name? Are they "guessable"? Like
```
for (int i = 0; i < 100000; i++)
    tryToOpen(String.format("file%05d", i))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
执笔经年

2020-12-06 00:35
This also requires Java 7, but it's simpler than the Files.walkFileTree answer if you just want to list the contents of a directory and not walk the whole tree:
```
Path dir = Paths.get("/some/directory");
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
    for (Path path : stream) {
        handleFile(path.toFile());
    }
} catch (IOException e) {
    handleException(e);
}
```
The implementation of DirectoryStream is platform-specific and never calls File.list or anything like it, instead using the Unix or Windows system calls that iterate over a directory one entry at a time.
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-12-06 00:37

Since you're on Windows, it seems like you should have simply used ProcessBuilder to start something like "cmd /k dir /b target_directory", capture the output of that, and route it into a file. You can then process that file a line at a time, reading the file names out and dealing with them.

Better late than never? ;)

0 讨论(0)
发布评论:

提交评论
- 加载中...

爱一瞬间的悲伤

2020-12-06 00:38

Try this, it works to me, but I hadn't so many documents...

File dir = new File("directory");
String[] children = dir.list();
if (children == null) {
   //Either dir does not exist or is not a  directory
  System.out.print("Directory doesn't  exist\n");
}
else {
  for (int i=0; i<children.length; i++) {   
    // Get filename of file or directory   
    String filename = children[i];  
}

0 讨论(0)