I have to write a not-so-large program in C++, using boost::thread.
The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millon
The answer depends somewhat on how CPU intensive the processing you need to perform on each file is.
At one extreme where the processing time dominates the I/O time, the benefit that threading gives you is just the ability to take advantage of multiple cores (and possibly hyperthreading) to make use of the maximum available processing power of your CPU. In this case you'd want to aim for a number of worker threads roughly equal to the number of logical cores on the system.
At the other extreme where I/O is your bottleneck you aren't going to see all that much benefit from multiple threads since they will spend most of their time sleeping waiting for I/O to complete. In that case you'd want to focus on maximizing your I/O throughput rather than your CPU utilization. On a single unfragmented hard drive or a DVD where you were I/O bound having multiple threads would likely hurt performance since you'd get maximum I/O throughput from sequential reads on a single thread. If the drive is fragmented or you have a RAID array or similar then having multiple I/O requests in flight simultaneously might boost your I/O throughput since the controller may be able to intelligently rearrange them to make more efficient reads.
I think it might be helpful to view this as really two separate problems. One is how to get maximum I/O throughput for your file reads, the other is how to make maximum use of your CPU for processing the files. You would probably get optimal throughput by having a small number of I/O threads kicking off I/O requests and a pool of worker threads roughly equal to the number of logical CPU cores processing the data as it becomes available. Whether it is worth the effort to implement a more complex setup like that depends on where the bottlenecks are in your particular problem though.
There are a lot of variables that will effect performance (OS, filesystem, hard drive speed vs CPU speed, data access patterns, how much processing is done on the data after it is read, etc).
So your best bet is to simply try a test run for every possible thread count, on a representative data set (a big one if possible, so that filesystem caching won't skew the results too badly), and record how long it takes each time. Start with a single thread, then try it again with two threads, and so on until you feel you have enough data points. At the end you should have data that graphs into a nice curve that indicates where the "sweet spot" is. You should be able to do this in a loop so that the results are compiled automatically overnight.
More threads will not necessarily give you higher throughput. Threads have a non-trivial cost, both to create (in terms of CPU time and OS resources) and to run (in terms of memory and scheduling). And the more threads you have, the more potential for contention with other threads. Adding threads can sometimes even slow down execution. Each problem is subtly different, and you are best off writing a nice, flexible solution and experimenting with the parameters to see what works best.
Your example code, spawning a thread for each file, would almost immediately swamp the system for values of max_threads
beyond around 10. As others have suggested, a thread pool with a work queue is what you probably want. The fact that each file is independent is nice, as that makes it almost embarrassingly parallel (save for the aggregation at the end of each unit of work).
Some factors that will affect your throughput:
Last year I wrote an application that does essentially the same as you describe. I ended up using Python and the pprocess library. It used a multi-process model with a pool of worker processes, communicating via pipes (rather than threads). A master process would read the work queue, chop up the input into chunks, and send chunk info to the workers. A worker would crunch the data, collecting stats, and when it was done send the results to back the master. The master would combine the results with the global totals and send another chunk to the worker. I found it scaled almost linearly up to 8 worker threads (on an 8-core box, which is pretty good), and beyond that it degraded.
Some things to consider:
mmap()
(or equivalent) to memory map the input files, but only after you've profiled the baseline caseWhen you have a significant number of files in the one directory as you describe, aside from potentially hitting filesystem limits, the time to stat the directory and figure out which files you've already processed and which you still need to goes up significantly. Consider breaking up the files into subdirectories by date, for example.
One more word on performance profiling: be careful when extrapolating performance from small test data sets to super-huge data sets. You can't. I found out the hard way that you can reach a certain point where regular assumptions about resources that we make every day in programming just don't hold any more. For example, I only found out the statement buffer in MySQL is 16MB when my app went way over it! And keeping 8 cores busy can take a lot of memory, but you can easily chew up 2GB of RAM if you're not careful! At some point you have to test on real data on the production system, but give yourself a safe test sandbox to run in, so you don't munge production data or files.
Directly related to this discussion is a series of articles on Tim Bray's blog called the "Wide Finder" project. The problem was simply to parse logfiles and generate some simple statistics, but in the fastest manner possible on a multicore system. Many people contributed solutions, in a variety of languages. It is definitely worth reading.
If the workload is anywhere near as I/O bound as it sounds, then you're probably going to get maximum throughput with about as many threads as you have spindles. If you have more than one disk and all data is on the same RAID 0, you probably don't want any more than one thread. If more than one thread is trying to access non-sequential parts of the disk, the OS must stop reading one file, even though it may be right under the head, and move to another part of the disk to service another thread, so that it doesn't starve. With only one thread, the disk need never stop reading to move the head.
Obviously that depends on the access patterns being very linear (such as with video recoding) and the data actually being unfragmented on disk, which it depends on a lot. If the workload is more CPU bound, then it won't matter quite as much and you can use more threads, since the disk will be twiddling its thumbs anyway.
As other posters suggest, profile first!
According to Amdahl's law that was discussed by Herb Sutter in his article:
Some amount of a program's processing is fully "O(N)" parallelizable (call this portion p), and only that portion can scale directly on machines having more and more processor cores. The rest of the program's work is "O(1)" sequential (s). [1,2] Assuming perfect use of all available cores and no parallelization overhead, Amdahl's Law says that the best possible speedup of that program workload on a machine with N cores is given by
In your case I/O operations could take most of the time, as well as synchronization issues. You could count time that will be spend in blocking(?) slow I/O operations and approximately find number of threads that will be suitable for your task.
Full list of concurrency related articles by Herb Sutter could be found here.
Not to sound trite but you use as many threads as you need.
Basically you can draw a graph of the number of threads against the (real) time to completion. You can also draw one that is total threads to total thread time.
The first graph in particular will help you identify where the bottleneck in CPU power lies. At some point you will become either I/O bound (meaning the disk can't load the data fast enough) or the number of threads will become so large it will impact performance of the machine.
The second does happen. I saw one piece of code that ended up creating 30,000+ threads. It ended up being quicker by capping it to 1,000.
The other way to look at this is: how fast is fast enough? The point where I/O becomes a bottleneck is one thing but you may hit a point before that where it's "fast enough".