Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?
A colleague of mine argues
To fix up a misaligned read, the processor needs to do two aligned reads and fix up the result. This is slower than having to do one read and no fix-ups.
The Snappy code has special reasons for exploiting unaligned access. It will work on x86_64; it won't work on architectures where unaligned access is not an option, and it will work slowly where fixing up unaligned access is a system call or a similarly expensive operation. (On DEC Alpha, there was a mechanism approximately equivalent to a system call for fixing up unaligned access, and you had to turn it on for your program.)
Using unaligned access is an informed decision that the authors of Snappy made. It does not make it sensible for everyone to emulate it. Compiler writers would be excoriated for the poor performance of their code if they used it by default, for example.