Is it a good idea to read multiple files at the same time?

前端未结

关注

 1  1176

One of our company\'s server has 32 CPUs, and we have 1000+ very large files to be processed. I\'m not sure if it is a good idea to read 32 files at the same time so all cor

相关标签:

1条回答

我在风中等你

2021-01-06 13:47
The hard disk is traditionally a mechanical data storage device. I'm assuming the server uses mechanical ones, and not the newer SSD type of hard disks, which have no moving parts. I'm also assuming with this much data and processing power, that more than one hard disk is being used (RAID or NAS.) These details can affect the performance significantly, and could render much of the following as inaccurate.

Hard disks, being mechanical devices, have a spinning disc inside like an old-fashioned record player or CD. It is coated with a material that can record and playback tiny magnetic pulses. A positionable "read-write" head flies right above the surface of each disk, usually on both sides of it, ready to move across the surface of each disk to locate, read, and write these magnetic pulses. Both the spinning and movement take time. The more "work" a disk is given to do, the longer it takes to finish, simply because it has to physically locate more microscopic areas on the surface of the disks.

That said, imagine there are 29 employees assigned to read all 29 volumes of the Encyclopedia Brittanica. (3 supervisors, of course.) Each volume is stored on one hard disk, so there are 29 hard disks. There are two ways in which the whole thing can be read:
1. Pickup the 1st volume, and sequentially, have each employee start reading one page at a time until all volumes are finished. The supervisors collect and re-order all of the pages as they are processed, one volume at a time.
2. Pick up all 29 volumes at the same time, and try reading pages at essentially random (the net effect) until all volumes are finished. The supervisors collect and re-order all pages from 29 random chapters as they are processed...
Option #1 seems "antiquated", however one important thing about this method is that the other 28 disks are not being used at all. Only one is. Hard disks are far better at reading data sequentially than randomly. This is because sequential reading avoids the delays caused by the read-write heads seeking back and forth.

Option #2 would work, and sounds reasonable, but it isn't ideal for two reasons: a) almost no sequential reading, and b) all of the disks are in use. This uses more power and puts a bigger demand on the server to run all of those disks concurrently.

So yes, if you try to process 32 huge files simultaneously, then that is going to place a tremendous load on the disks, and they will probably slow to a crawl. It is more complicated, but likely a better solution, to have the 32 cores "take turns" with one of those huge files at a time until they are all processed. (By "take turns" I mean break it up into smaller, more manageable chunks.) Again, the goal is to make the disks read as sequentially as possible, and avoid random seeking-back-and-forth.

Software to accomplish this must be multi-threaded, meaning that just one program is started by the user, but it creates 31 new "worker threads" for the other CPU cores. The main program starts reading data, sequentially, and splits this incoming data off into chunks for the other threads (cores) to process. Those all then "take turns" crunching small pieces of the whole data file, until it is completely processed.
0 讨论(0)
发布评论:

提交评论
- 加载中...