问题
I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.
I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?
Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).
So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.
回答1:
What you are trying to do is intrinsically difficult for any operating system to do efficiently. 45,000 subdirectories requires a lot of disk access no matter how it is sliced.
Any file over about 1,000 bytes is "big" as far as NTFS is concerned. If there were a way to make most data files less than about 900 bytes, you could realize a major efficiency by having the file data stored inside the MFT. Then it would be no more expensive to obtain the data than it is to obtain the file's timestamps or size.
I doubt there is any way to optimize the program's parameters, process options, or even the operating system's tuning parameters to make the application work well. You are faced with multi-hour operation unless you can rearchitect it in a radically different way.
One strategy would be to distribute the files across multiple computers—probably thousands of them—and have a sub-application on each process the local files, feeding whatever results to a master application.
Another strategy would be to re-architect all the files into a few larger files, like big .zip files as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently far more efficient and effective use of resources than accessing 4 billion 1 MB files. Also moving all the data into a suitable database manager (MySQL, SQL Server, etc.) would accomplish this and perhaps provide other benefits like easy searches and an easy archival strategy.
回答2:
An overhead of 5 to 20ms per file isn't abnormal for an NTFS volume with that number of files. (On a conventional spindled drive, you can't expect much better than that anyway, because it's on the same order as the head seek times. From this point on, I'll assume we're dealing with enterprise-class hardware, SSD and/or RAID.)
Based on my experiences, you can significantly increase throughput by parallelizing the requests, i.e., using multiple threads and/or processes. Most of the overhead appears to be per-thread, the system can open ten files at once nearly as quickly as it can open a single file by itself. I'm not sure why this is. You might need to experiment to find the optimum level of parallelization.
The system administrator can also significantly improve performance by copying the contents to a new volume, preferably in approximately the same order that they will be accessed. I had to do this recently, and it reduced backup time (for a volume with about 14 million files) from 85 hours to 18 hours.
You might also try OpenFileById() which may perform better for files in large directories, since it bypasses the need to enumerate the directory tree. However, I've never tried it myself, and it might not have much impact since the directory is likely to be cached anyway if you've just enumerated it.
You can also enumerate the files on the disk more quickly by reading them from the MFT, although it sounds as if that isn't a bottleneck for you at the moment.
回答3:
There is an hack you can try: zip these files with a low compression ratio and then use some Zip Libraries to read them, this is usually way faster than reading the single files one by one. Of copurse this should be done in advance as a pre process step.
回答4:
You might try doing one pass to enumerate the files to a data structure and then open and close them in a second pass, to see whether interleaving the operations is causing contention.
As I posted in the comments, there are lots of performance concerns about having huge numbers of entries in a single NTFS directory. So if you have control over how those files are distributed across directories, you might want to take advantage of that.
Also check for anti-malware on your system. Some will slow down every file access by scanning the entire file each time you try to access it. Using Sysinternals Procmon can help you spot this kind of problem.
When trying to improve performance, it's a good idea to set a goal. How fast is fast enough?
EDIT: This part of the original answer doesn't apply unless you're using Windows XP or earlier:
Opening and closing each file will, by default, update the last-access time in the index. You could try an experiment where you turn that feature off via registry or command line and see how big of a difference it makes. I'm not sure if it's a feasible thing to do in your actual product, since it's a global setting.
回答5:
NTFS is slow with large number of files. Especially if they are in the same directory. When they are divided in separate dirs and subdirs, the access is faster. I have experience with many files stored by video camera board (4 cameras) and it was too slow even to see the number of files and size (Properties on root folder). It is interesting that when the disk is FAT32, the same is much faster. And all sources say that NTFS is faster... Maybe is faster for reading of single file, but directory operations are slower.
Why you need so many files? I hope directory indexing service is enabled.
来源:https://stackoverflow.com/questions/27845026/opening-many-small-files-on-ntfs-is-way-too-slow