I have to write a not-so-large program in C++, using boost::thread.
The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millon
This might be a bit too old school sounding but have you considered simply forking processes? It sounds like you have highly independent work units with a small aggregation of return data. A process model would also free up virtual address space (which might be tight if you're on a 32-bit machine) allowing each worker room to say mmap() the whole file being processed.
I agree with everyone suggesting a thread pool: You schedule tasks with the pool, and the pool assigns threads to do the tasks.
If you're CPU-bound, simply keep adding threads as long as the CPU usage is below 100%. When you're I/O bound, disk thrashing might at some point prevent more threads from improving speed. That you'll have to find out yourself.
Have you seen Intel's Threading Building Blocks? Note that I cannot comment whether this is what you need. I only made a small toy project on Windows and that was a few years ago. (It was somewhat similar to yours, BTW: It recursively traverses a folder hierarchy and counts lines in source code files it finds.)
As a ballpark number, you should probably keep the thread count between 10 and 100 to minimize lock contention and context switching overhead.