No free lunch.
Please take a look the original work stealing paper. This paper is hard to understand. I know that paper contains theoretical proof rather than pseudo code. However, there is simply no such much more simple version than TBB. If any, it won't give optimal performance. Work stealing itself incurs some amount of overhead, so optimizations and tricks are quite important. Especially, dequeues are must be thread-safe. Implementing highly scalable and low-overhead synchronizations are challenging.
I'm really wondering why you need it. I think that proper implementation means something like TBB and Cilk. Again, work stealing is hard to implement.