I\'m considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps
to relax the alignment constraint and
There are two questions here: Are unaligned loads slower than aligned loads given the same aligned addresses? And are loads with unaligned addresses slower than loads with aligned addresses?
Older Intel CPUs (“older” in this case is just a few years ago) did have slight performance penalties for using unaligned load instructions with aligned addresses, compared to aligned loads with new addresses. Newer CPUs tend not to have this issue.
Both older and newer Intel CPUs have performance penalties for loading from unaligned addresses, notably when cache lines are crossed.
Since the details vary from processor model to processor model, you would have to check each one individually for details.
Sometimes performance issues can be masked. Simple sequences of instructions used for measurement might not reveal that unaligned-load instructions are keeping the load-store units busier than aligned-load instructions would, so that there would be a performance degradation if certain additional operations were attempted in the former case but not in the latter.