Opening many small files on NTFS is way too slow

后端 未结 5 1282
我寻月下人不归
我寻月下人不归 2021-02-14 14:53

I am writing a program that should process many small files, say thousands or even millions. I\'ve been testing that part on 500k files, and the first step was just to iterate a

5条回答
  •  [愿得一人]
    2021-02-14 15:27

    An overhead of 5 to 20ms per file isn't abnormal for an NTFS volume with that number of files. (On a conventional spindled drive, you can't expect much better than that anyway, because it's on the same order as the head seek times. From this point on, I'll assume we're dealing with enterprise-class hardware, SSD and/or RAID.)

    Based on my experiences, you can significantly increase throughput by parallelizing the requests, i.e., using multiple threads and/or processes. Most of the overhead appears to be per-thread, the system can open ten files at once nearly as quickly as it can open a single file by itself. I'm not sure why this is. You might need to experiment to find the optimum level of parallelization.

    The system administrator can also significantly improve performance by copying the contents to a new volume, preferably in approximately the same order that they will be accessed. I had to do this recently, and it reduced backup time (for a volume with about 14 million files) from 85 hours to 18 hours.

    You might also try OpenFileById() which may perform better for files in large directories, since it bypasses the need to enumerate the directory tree. However, I've never tried it myself, and it might not have much impact since the directory is likely to be cached anyway if you've just enumerated it.

    You can also enumerate the files on the disk more quickly by reading them from the MFT, although it sounds as if that isn't a bottleneck for you at the moment.

提交回复
热议问题