I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur).
Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.
You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.
Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...
To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.
Maybe write your own tiny library which implements a few standard threading functions using #ifdef
's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.
I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.