Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Preface: I am trying to do this
SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).
I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.
It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.