I have a question that could seem very basic, but it is in a context where \"every CPU tick counts\" (this is a part of a larger algorithm that will be used on supercomputers).<
You might find this to be an interesting read. I would start with STL's sort and only then try and improve on it if I could. I'm not sure if you have access to a C++11 compiler (like gcc4.7) on this super computer, but I would suggest that std::sort with std::futures and std::threads would get you quite a bit of the way there with regard to parallelizing the problem in a maintainable way.
Here is another question that compares std::sort with qsort.
Finally, there is this article in Dr. Dobb's that compares the performance of parallel algorithms.