I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.
for(i = 0; i < h; i++) {
memcpy(d, s, w);
s += sp;
d += dp;
}
I know that the following
d, dp, s, sp, w
are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of
for (int i = 0; i < h; i++) {
uint8_t* dst = d;
const uint8_t* src = s;
int remaining = w;
asm volatile (
"1: \n"
"subs %[rem], %[rem], #32 \n"
"vld1.u8 {d0, d1, d2, d3}, [%[src],:256]! \n"
"vst1.u8 {d0, d1, d2, d3}, [%[dst],:256]! \n"
"bgt 1b \n"
: [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
:
: "d0", "d1", "d2", "d3", "cc", "memory"
);
d += dp;
s += sp;
}
Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?
ARM has a great tech note on this.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.
And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.
来源:https://stackoverflow.com/questions/11161237/fast-arm-neon-memcpy