Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Preface: I am trying to do this
Looks like a lot of SIMD/SSE optimizations were made in Java 8/9.
SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).
I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.
It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.
How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)
Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.
There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.
There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.
Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.
So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.
Use a low-level language designed for this task and link it to the Java for high-level logic.
EDIT: adding some more info based on comments
If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.
In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2
appears in following files:
They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.
I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions