Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

前端 未结 4 2011
礼貌的吻别
礼貌的吻别 2021-02-02 12:51

I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and

4条回答
  •  被撕碎了的回忆
    2021-02-02 13:13

    There are two questions here: Are unaligned loads slower than aligned loads given the same aligned addresses? And are loads with unaligned addresses slower than loads with aligned addresses?

    Older Intel CPUs (“older” in this case is just a few years ago) did have slight performance penalties for using unaligned load instructions with aligned addresses, compared to aligned loads with new addresses. Newer CPUs tend not to have this issue.

    Both older and newer Intel CPUs have performance penalties for loading from unaligned addresses, notably when cache lines are crossed.

    Since the details vary from processor model to processor model, you would have to check each one individually for details.

    Sometimes performance issues can be masked. Simple sequences of instructions used for measurement might not reveal that unaligned-load instructions are keeping the load-store units busier than aligned-load instructions would, so that there would be a performance degradation if certain additional operations were attempted in the former case but not in the latter.

提交回复
热议问题