I am modifying RNNLM a neural net to study language model. However given the size of my corpus it\'s running real slow. I tried to optimize the matrix*vector routine (which
TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.
Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps
compiles to.
I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps
at all for _mm256_load_ps
, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)
The store, on the other hand, does generate a vmov...
instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.
I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps
load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.
Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.