http://pastebin.com/YMS4ehRj
^ This is my implementation of parallel merge sort. Basically what I do is, For every split, the first half is handled by a thread where
Given you have a finite number of cores on your system, why would you want to create more threads than cores?
Also, it isn't clear why you need to have a mutex at all. As far as I can tell from a quick scan, the program doesn't need to share the threads[lthreadcnt] outside the local function. Just use a local variable and you should be golden.