Does it matter how many files I keep in a single directory? If so, how many files in a directory is too many, and what are the impacts of having too many files? (This is on
I have had over 8 million files in a single ext3 directory. libc readdir()
which is used by find
, ls
and most of the other methods discussed in this thread to list large directories.
The reason ls
and find
are slow in this case is that readdir()
only reads 32K of directory entries at a time, so on slow disks it will require many many reads to list a directory. There is a solution to this speed problem. I wrote a pretty detailed article about it at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/
The key take away is: use getdents()
directly -- http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html rather than anything that's based on libc readdir()
so you can specify the buffer size when reading directory entries from disk.
Not an answer, but just some suggestions.
Select a more suitable FS (file system). Since from a historic point of view, all your issues were wise enough, to be once central to FSs evolving over decades. I mean more modern FS better support your issues. First make a comparison decision table based on your ultimate purpose from FS list.
I think its time to shift your paradigms. So I personally suggest using a distributed system aware FS, which means no limits at all regarding size, number of files and etc. Otherwise you will sooner or later challenged by new unanticipated problems.
I'm not sure to work, but if you don't mention some experimentation, give AUFS over your current file system a try. I guess it has facilities to mimic multiple folders as a single virtual folder.
To overcome hardware limits you can use RAID-0.
There is no single figure that is "too many", as long as it doesn't exceed the limits of the OS. However, the more files in a directory, regardless of the OS, the longer it takes to access any individual file, and on most OS's, the performance is non-linear, so to find one file out of 10,000 takes more then 10 times longer then to find a file in 1,000.
Secondary problems associated with having a lot of files in a directory include wild card expansion failures. To reduce the risks, you might consider ordering your directories by date of upload, or some other useful piece of metadata.