Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?
A colleague of mine argues
Unaligned 32 and 64 bit access is NOT cheap.
I did tests to verify this. My results on Core i5 M460 (64 bit) were as follows: fastest integer type was 32 bit wide. 64 bit alignment was slightly slower but almost the same. 16 bit alignment and 8 bit alignment were both noticeably slower than both 32 and 64 bit alignment. 16 bit being slower than 8 bit alignment. The by far slowest form of access was non aligned 32 bit access that was 3.5 times slower than aligned 32 bit access (fastest of them) and unaligned 32 bit access was even 40% slower than unaligned 64 bit access.
Results: https://github.com/mkschreder/align-test/blob/master/results-i5-64bit.jpg?raw=true Source code: https://github.com/mkschreder/align-test
Aligned loads are stores are faster, two excerpts from the Intel Optimization Manual cleanly point this out:
3.6 OPTIMIZING MEMORY ACCESSES
Align data, paying attention to data layout and stack alignment
...
Alignment and forwarding problems are among the most common sources of large delays on processors based on Intel NetBurst microarchitecture.
AND
3.6.4 Alignment
Alignment of data concerns all kinds of variables:
• Dynamically allocated variables
• Members of a data structure
• Global or local variables
• Parameters passed on the stack
Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits.
Following that part in 3.6.4, there is a nice rule for compiler developers:
Assembly/Compiler Coding Rule 45. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.
followed by a listing of alignment rules and another gem in 3.6.6
User/Source Coding Rule 6. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.
Both rules are marked as high impact, meaning they can greatly change performance, along with the excerpts, the rest of Section 3.6 is filled with other reasons to naturally align your data. Its well worth any developers time to read these manuals, if only to understand the hardware he/she is working on.
To fix up a misaligned read, the processor needs to do two aligned reads and fix up the result. This is slower than having to do one read and no fix-ups.
The Snappy code has special reasons for exploiting unaligned access. It will work on x86_64; it won't work on architectures where unaligned access is not an option, and it will work slowly where fixing up unaligned access is a system call or a similarly expensive operation. (On DEC Alpha, there was a mechanism approximately equivalent to a system call for fixing up unaligned access, and you had to turn it on for your program.)
Using unaligned access is an informed decision that the authors of Snappy made. It does not make it sensible for everyone to emulate it. Compiler writers would be excoriated for the poor performance of their code if they used it by default, for example.
A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.
Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.
(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)
Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...
Unaligned loads/stores should never be used, but the reason is not performance. The reason is that the C language forbids them (both via the alignment rules and the aliasing rules), and they don't work on many systems without extremely slow emulation code - code which may also break the C11 memory model needed for proper behavior of multi-threaded code, unless it's done on a purely byte-by-byte level.
As for x86 and x86_64, for most operations (except some SSE instructions), misaligned load and store are allowed, but that doesn't mean they're as fast as correct accesses. It just means the CPU does the emulation for you, and does it somewhat more efficiently than you could do yourself. As an example, a memcpy
-type loop that's doing misaligned word-size reads and writes will be moderately slower than the same memcpy
doing aligned access, but it will also be faster than writing your own byte-by-byte copy loop.