Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?
A colleague of mine argues
Unaligned loads/stores should never be used, but the reason is not performance. The reason is that the C language forbids them (both via the alignment rules and the aliasing rules), and they don't work on many systems without extremely slow emulation code - code which may also break the C11 memory model needed for proper behavior of multi-threaded code, unless it's done on a purely byte-by-byte level.
As for x86 and x86_64, for most operations (except some SSE instructions), misaligned load and store are allowed, but that doesn't mean they're as fast as correct accesses. It just means the CPU does the emulation for you, and does it somewhat more efficiently than you could do yourself. As an example, a memcpy
-type loop that's doing misaligned word-size reads and writes will be moderately slower than the same memcpy
doing aligned access, but it will also be faster than writing your own byte-by-byte copy loop.