I have to write a not-so-large program in C++, using boost::thread.
The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millon
This might be a bit too old school sounding but have you considered simply forking processes? It sounds like you have highly independent work units with a small aggregation of return data. A process model would also free up virtual address space (which might be tight if you're on a 32-bit machine) allowing each worker room to say mmap() the whole file being processed.