Multithreaded File Compare Performance

时光毁灭记忆、已成空白 提交于 2020-01-07 02:37:34

问题


I just stumbled onto this SO question and was wondering if there would be any performance improvement if:

  1. The file was compared in blocks no larger than the hard disk sector size (1/2KB, 2KB, or 4KB)
  2. AND the comparison was done multithreaded (or maybe even with the .NET 4 parallel stuff)

I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.

I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).

Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...


回答1:


If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.

If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.

But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?

Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.

Remember, a typical disk drive can only satisfy one I/O request at a time.

Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.



来源:https://stackoverflow.com/questions/8470306/multithreaded-file-compare-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!