I have to deal with a directory of about 2 million xml\'s to be processed.
I\'ve already solved the processing distributing the work between machines and threads us
If file names follow certain rules, you can use File.list(filter)
instead of File.listFiles
to get manageable portions of file listing.
I faced same problem when I developed malware scanning application. My solution is execute shell command to list all files. It's faster than recursively methods to browse folder by folder.
see more about shell command here: http://adbshell.com/commands/adb-shell-ls
Process process = Runtime.getRuntime().exec("ls -R /");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
//TODO: Read the stream to get a list of file path.
You could use listFiles with a special FilenameFilter. The first time the FilenameFilter is sent to listFiles it accepts the first 1000 files and then saves them as visited.
The next time FilenameFilter is sent to listFiles, it ignores the first 1000 visited files and returns the next 1000, and so on until complete.
As a first approach you might try tweaking some JVM memory settings, e.g. increase heap size as it was suggested or even use AggressiveHeap option. Taking into account the large amount of files, this may not help, then I would suggest to workaround the problem. Create several files with filenames in each, say 500k filenames per file and read from them.
Use File.list() instead of File.listFiles()
- the String
objects it returns consume less memory than the File
objects, and (more importantly, depending on the location of the directory) they don't contain the full path name.
Then, construct File
objects as needed when processing the result.
However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.
At fist you could try to increase the memory of your JVM with passing -Xmx1024m e.g.