I have to deal with a directory of about 2 million xml\'s to be processed.
I\'ve already solved the processing distributing the work between machines and threads us
If Java 7 is not an option, this hack will work (for UNIX):
Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
if (line.startsWith("."))
continue;
System.out.println(line);
}
The -f parameter will speed it up (from man ls
):
-f do not sort, enable -aU, disable -lst
Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.
It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?
First of all, do you have any possibility to use Java 7? There you have a FileVisitor
and the Files.walkFileTree
, which should probably work within your memory constraints.
Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter)
with a filter that always returns false
(ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.
Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000
-filefile0001000
then file0001000
-filefile0002000
and so on.
If the names are not named in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.
Update: Sigh. Probably won't work. Just had a look at the listFiles implementation:
public File[] listFiles(FilenameFilter filter) {
String ss[] = list();
if (ss == null) return null;
ArrayList v = new ArrayList();
for (int i = 0 ; i < ss.length ; i++) {
if ((filter == null) || filter.accept(this, ss[i])) {
v.add(new File(ss[i], this));
}
}
return (File[])(v.toArray(new File[v.size()]));
}
so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.
Btw, could you give an example of a file name? Are they "guessable"? Like
for (int i = 0; i < 100000; i++)
tryToOpen(String.format("file%05d", i))
This also requires Java 7, but it's simpler than the Files.walkFileTree
answer if you just want to list the contents of a directory and not walk the whole tree:
Path dir = Paths.get("/some/directory");
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path path : stream) {
handleFile(path.toFile());
}
} catch (IOException e) {
handleException(e);
}
The implementation of DirectoryStream
is platform-specific and never calls File.list
or anything like it, instead using the Unix or Windows system calls that iterate over a directory one entry at a time.
Since you're on Windows, it seems like you should have simply used ProcessBuilder to start something like "cmd /k dir /b target_directory", capture the output of that, and route it into a file. You can then process that file a line at a time, reading the file names out and dealing with them.
Better late than never? ;)
Try this, it works to me, but I hadn't so many documents...
File dir = new File("directory");
String[] children = dir.list();
if (children == null) {
//Either dir does not exist or is not a directory
System.out.print("Directory doesn't exist\n");
}
else {
for (int i=0; i<children.length; i++) {
// Get filename of file or directory
String filename = children[i];
}