To implement "work stealing" isn't hard in theory. You need a set of queues containing tasks that do work by doing a combination of computing and generating other tasks to do more work. And you need atomic access to the queues to place newly generated tasks into those queues. Finally, you need a procedure that each task calls at the end, to find more work for the thread that executed the task; that procedure needs to look in work queues to find work.
Most such work-stealing systems make the assumption that there are a small number of threads (backed up typically by real processor cores), and that there is a exactly one work queue per thread. Then you first try to steal work from your own queue, and if it is empty, try to steal from others. What gets tricky is knowing which queues to look in; scanning them serially for work is pretty expensive and can create a huge amount of contention between threads looking for work.
So far this is all pretty generic stuff with one two major exceptions: 1) switching contexts (e.g, setting processor context registers such as a "stack") cannot be stated in pure C or C++. You can resolve this by agreeing to write part of your package in target-platform specific machine code. 2) Atomic access to the queues for a multiprocessor cannot be done purely in C or C++ (ignoring Dekker's algorithm), and so you'll need to code those using assembly language synchronization primitives like the X86 LOCK XCH or Compare and Swap. Now, the code involved in updating the queuse once you have safe access isn't very complex, and you could easily write that in a few lines of C.
However, I think you will find is that attempting to code such a package in C and C++ with mixed assembler is still rather inefficient and you'll eventually end up coding the entire thing in assembler anyway. All's that left are C/C++ compatible entry points :-}
I did this for our PARLANSE parallel programming language, which offers the idea of an arbitrarily large number of parallel computations live and interacting (synchonizing) at any instant. It is implemented behind the scenes on an X86 exactly with one thread per CPU, and the implementation is entirely in assembler. The work-stealing code is probably 1000 lines total, and its tricky code because you want it to be extremely fast in the non-contention case.
The real fly in the ointment for C and C++ is, when you create a task representing work, how much stack space do you assign? Serial C/C++ programs avoid this question by simply overallocating huge amounts (e.g, 10Mb) of one linear stack, and nobody cares much about how much of that stack space is wasted. But if you can create thousands of tasks and have them all live at a particular instant, you can't reasonably allocate 10Mb to each one. So now you either need to determine statically how much stack space a task will need (Turing-hard), or you'll need to allocate stack chunks (e.g., per function call), which widely available C/C++ compilers don't do (e.g, the one you are likely using). THe last way out is to throttle task creation to limit it to a few hundred at any instant, and multiplex a few-hundred really huge stacks among the tasks that are live. You can't do the last if the tasks can interlock/suspend state, because you'll run into your threshold. So you can only do this if the tasks only do computation. That seems like a pretty severe constraint.
For PARLANSE, we built a compiler that allocates activation records on the heap for each function call.