It used to be that ARM processors were unable to properly handle unaligned memory access (ARMv5 and below). Something like u32 var32 = *(u32*)ptr;
would just fail (
Part of the issue is likely that you are not allowing for easy inlinability and further optimization. Having a specialized function for the load means that a function call may be emitted upon each call, which could reduce the performance.
One thing you might do is use static inline
, which will allow the compiler to inline the function load32()
, thus increasing performance. However, at higher levels of optimization, the compiler should already be inlining this for you.
If the compiler inlines a 4 byte memcpy, it will likely transform it into the most efficient series of loads or stores that will still work on unaligned boundaries. Therefore, if you are still seeing low performance even with compiler optimizations enabled, it may be so that that is the maximum performance for unaligned reads and writes on the processors you are using. Since you said "__packed
instructions" are yielding identical performance to memcpy()
, this would seem to be the case.
At this point, there is very little that you can do except to align your data. However, if you are dealing with a contiguous array of unaligned u32
's, there is one thing you could do:
#include
#include
// get array of aligned u32
uint32_t *align32 (const void *p, size_t n) {
uint32_t *r = malloc (n * sizeof (uint32_t));
if (r)
memcpy (r, p, n);
return r;
}
This just uses allocates a new array using malloc()
, because malloc()
and friends allocate memory with correct alignment for everything:
The malloc() and calloc() functions return a pointer to the allocated memory that is suitably aligned for any kind of variable.
- malloc(3), Linux Programmer's Manual
This should be relatively fast, as you should only have to do this once per set of data. Also, while copying it, memcpy()
will be able to adjust only for the initial lack of alignment and then use the fastest aligned load and store instructions available, after which you will be able to deal with your data using the normal aligned reads and writes at full performance.