My preference would be C or C++ because I'm not separated from the machine language by a JIT compiler.
You want to do intense performance tuning, and that means stepping through the hot spots one instruction at a time to see what it is doing, and then tweaking the source code so as to generate optimal assembler.
If you can't get the compiler to generate what you consider good enough assembler code, then by all means write your own assembler for the hot spot(s). You're describing a situation where the need for performance is paramount.
What I would NOT do if I were in your shoes (or ever) is rely on anecdotal generalizations about one language being faster or slower than another. What I WOULD do is multiple passes of intense performance tuning along the lines of THIS and THIS and THIS. I have done this sort of thing numerous times, and the key is to iterate the cycle of diagnosis-and-repair because every slug fixed makes the remaining ones more evident, until you literally can't squeeze another cycle out of that turnip.
Good luck.
Added: Is it the case that there is some seldom-changing configuration information that determines how the bulk of the data is processed? If so, it may be that the program is spending a lot of its time re-interpreting the configuration info to figure out what to do next. If so, it is usually a big win to write a code generator that will read the configuration info and generate an ad-hoc program that can whizz through the data without constantly having to figure out what to do.