I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps
to relax the alignment constraint and
See "§2.4.5.1 Efficient Handling of Alignment Hazards" in Intel® 64 and IA-32 Architectures Optimization Reference Manual:
The cache and memory subsystems handles a significant percentage of instructions in every workload. Different address alignment scenarios will produce varying performance impact for memory and cache operations. For example, 1-cycle throughput of L1 (see Table 2-25) generally applies to naturally-aligned loads from L1 cache. But using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access data from L1 will experience varying amount of delays depending on specific microarchitectures and alignment scenarios.
I couldn't copy the table here, it basically shows that aligned and unaligned L1 loads are 1 cycle; split cache line boundary is ~4.5 cycles.