I want to read a file as fast as possible (40k lines) [Edit : the rest is obsolete].
Edit: Andres Jaan Tack suggested a solution based on one thread per file, and I
[Edit: original question asked if launching up to 40,000 threads would speed up file read]
What you suggest would most likely slow down the access due to the overhead of creating threads and context switching. More threads only help if you are
1) computationally bound and you have extra cores that could help with the work
2) blocking and other threads could work while waiting for others to unblock
3) you have a very clever algorithm that makes use of cache behavior
Most likely your speed is bound by disk and/or memory bandwidth not computational limits so a single execution thread would be able to max those out.
Yes, it's a waste of time. At very best you'll end up with about the same performance. At worst, it might hurt performance from the disk seeking to different parts of the file instead of reading through it consecutively.
In contrast to other readers I believe that theoretically there can be some benifit, even if you're running on an SP (single-processor) system. However I'd never do this for as much as 40K lines (assuming you talk about normal-sized lines).
They key is Amardeep's answer, where he/she says that creating threads is useful when a thread becomes blocked for some reason.
Now, how do mapped files "work"? When you access a memory page in that region for the first time - the processor generates a page fault. The OS loads the contents of the file (this involves disk access) into the memory page. Then the execution returns to your thread.
I also believe upon page fault the OS fills a bunch of consecutive pages, not just single one.
Now, what's important is that during the page fault processing your thread is suspended. Also during this period the CPU isn't loaded (apart from what other processes may do).
So that if you look at the time scale you see a period of two sections: one where CPU is loaded (here you read the contents of the page and do some processing), and one where CPU is nearly idle and the I/O on the disk is performed.
On the other hand you may create several threads, each one is assigned to read a different portion of the file. You benefit from two effects:
Other thread has a chance to load the CPU (or several CPUs if MP system) when one is blocked by I/O.
Even in case where the processing is very short (hence the CPU is not the bottleneck) - still there's a benefit. It's related to the fact that if you issue several I/O on the same physical device - it has a chance to perform them more efficiently.
For instance, when reading many different sectors from the HD drive you can actually read them all within one disk rotation.
P.S.
And, of course, I'd never thought to do this for 40K lines. The overhead of creating threads, waiting for them to finish, context switches, logic complification, error/failure handling, and etc.
I'd try to do this for a file of at least tens of MBs.
Your question got a little bit deeper, when you asked further. I'll try to cover all your options...
Use one thread.
If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of mmap
.
The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.
Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.
Use as many threads as you have files (or some reasonable number).
File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.
Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.
A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.
Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.
One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call write()
. Also, since the kernel will lock the file when you write it, don't call write()
from all of your threads together; they'll all wait for each other instead of processing more data.
I'm thinking like this.
You have 8 cores, so make 8 threads. Let each thread parse one block of the file. So you need to get the device/disk block size. When a block has been parsed by a thread let the thread parse a new one not yet "assigned" to a thread.
Another idea I have would to have 2 threads. A parsing thread and a thread just stepping over the file's disk blocks, ie just by reading the first byte of each block thus forcing the file to be read into memory as fast as possible.
But, this could be made into a contest. Nothing beats doing real live runs! and people will show you! :) find a suitable price!
This is a problem of granularity. You've got a small file, and very little processing to do. One thread can probably gobble the entire file in one time slice and process it in the next. Two threads would be worse than one. You need a much larger task before considering parallelism as a performance solution.