Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

前端 未结 4 2010
礼貌的吻别
礼貌的吻别 2021-02-02 12:51

I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and

4条回答
  •  孤独总比滥情好
    2021-02-02 13:24

    See "§2.4.5.1 Efficient Handling of Alignment Hazards" in Intel® 64 and IA-32 Architectures Optimization Reference Manual:

    The cache and memory subsystems handles a significant percentage of instructions in every workload. Different address alignment scenarios will produce varying performance impact for memory and cache operations. For example, 1-cycle throughput of L1 (see Table 2-25) generally applies to naturally-aligned loads from L1 cache. But using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access data from L1 will experience varying amount of delays depending on specific microarchitectures and alignment scenarios.

    I couldn't copy the table here, it basically shows that aligned and unaligned L1 loads are 1 cycle; split cache line boundary is ~4.5 cycles.

提交回复
热议问题