I am developing a scientific application used to perform physical simulations. The algorithms used are O(n3), so for a large set of data it takes a very long time to process.
You already suggested it, but disqualified it: Multi cores. You could go for multi core, if you had enough cores. One hot topic atm is GPGPU computing. Esp. NVIDIAs CUDA is a very priomising approach if you have many independent task which have to do the same computation. A GTX 280 delivers you 280 cores, which can compute up to 1120 - 15360 threads simultanously . A pair of them could solve your problem. If its really implementable depends on your algorithm (data flow vs. control flow), because all scalar processors operate in a SIMD fashion.
Drawback: it would be C/C++, not java