How could NEON be as slow as C?
I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers
I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I removed all linked libraries with the exception of Accelerate.
Here is the C implementation: the older version was plenty of if then but here is the final optimized logic. it took 11s and 300ms.
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=88.5;
}
//Assume range is [0,25,50,75,100]
int lcd=25;
for (int j=0; j<1000; ++j)// just to get some good time interval
{
for (int i=0; i<sizeOfArray; ++i)
{
//a 60.5 would give a 50. An 88.5 would give 100
outputArray[i]=roundf(inputArray[i]/lcd)*lcd;
}
}
NSLog(@"done");
}
Here is the vDSP implementation. Even with some of the tedious floating to integer back and forth, it took only 6s! almost 50% improvement!
//vDSP implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}
for (int j=0; j<1000; ++j)// just to get some good time interval
{
//Assume range is [0,25,50,75,100]
float lcd=25.0f;
//divide by lcd
vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);
//Round to nearest integer
vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);
// MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);
}
NSLog(@"done");
}
Here is the Neon implementation. This is my first so play nice! it was slower than vDSP and took 9 sec and 300ms which did not make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.
//NEON implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}
for (int j=0; j<1000; ++j)// just to get some good time interval
{
float32x4_t c0,c1,c2,c3;
float32x4_t e0,e1,e2,e3;
float32x4_t f0,f1,f2,f3;
//ranges of histogram buckets
float32x4_t buckets0=vdupq_n_f32(0);
float32x4_t buckets1=vdupq_n_f32(25);
float32x4_t buckets2=vdupq_n_f32(50);
float32x4_t buckets3=vdupq_n_f32(75);
float32x4_t buckets4=vdupq_n_f32(100);
//midpoints of ranges
float32x4_t thresholds1=vdupq_n_f32(12.5);
float32x4_t thresholds2=vdupq_n_f32(37.5);
float32x4_t thresholds3=vdupq_n_f32(62.5);
float32x4_t thresholds4=vdupq_n_f32(87.5);
for (int i=0; i<sizeOfArray;i+=16)
{
c0= vld1q_f32(&inputArray[i]);//load
c1= vld1q_f32(&inputArray[i+4]);//load
c2= vld1q_f32(&inputArray[i+8]);//load
c3= vld1q_f32(&inputArray[i+12]);//load
f0=buckets0;
f1=buckets0;
f2=buckets0;
f3=buckets0;
//register0
e0=vcgtq_f32(c0,thresholds1);
f0=vbslq_f32(e0, buckets1, f0);
e0=vcgtq_f32(c0,thresholds2);
f0=vbslq_f32(e0, buckets2, f0);
e0=vcgtq_f32(c0,thresholds3);
f0=vbslq_f32(e0, buckets3, f0);
e0=vcgtq_f32(c0,thresholds4);
f0=vbslq_f32(e0, buckets4, f0);
//register1
e1=vcgtq_f32(c1,thresholds1);
f1=vbslq_f32(e1, buckets1, f1);
e1=vcgtq_f32(c1,thresholds2);
f1=vbslq_f32(e1, buckets2, f1);
e1=vcgtq_f32(c1,thresholds3);
f1=vbslq_f32(e1, buckets3, f1);
e1=vcgtq_f32(c1,thresholds4);
f1=vbslq_f32(e1, buckets4, f1);
//register2
e2=vcgtq_f32(c2,thresholds1);
f2=vbslq_f32(e2, buckets1, f2);
e2=vcgtq_f32(c2,thresholds2);
f2=vbslq_f32(e2, buckets2, f2);
e2=vcgtq_f32(c2,thresholds3);
f2=vbslq_f32(e2, buckets3, f2);
e2=vcgtq_f32(c2,thresholds4);
f2=vbslq_f32(e2, buckets4, f2);
//register3
e3=vcgtq_f32(c3,thresholds1);
f3=vbslq_f32(e3, buckets1, f3);
e3=vcgtq_f32(c3,thresholds2);
f3=vbslq_f32(e3, buckets2, f3);
e3=vcgtq_f32(c3,thresholds3);
f3=vbslq_f32(e3, buckets3, f3);
e3=vcgtq_f32(c3,thresholds4);
f3=vbslq_f32(e3, buckets4, f3);
vst1q_f32(&finalOutputArrayF[i], f0);
vst1q_f32(&finalOutputArrayF[i+4], f1);
vst1q_f32(&finalOutputArrayF[i+8], f2);
vst1q_f32(&finalOutputArrayF[i+12], f3);
}
}
NSLog(@"done");
}
PS: this is my first benchmarking on this scale so I tried to keep it simple (large loops, setup code constant, using NSlog to print start/end time, only accelerate framework linked). If any of these assumptions are significantly impacting the outcome, please critique.
Thanks
First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.
vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.
I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.
If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).
If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:
- http://www.coranac.com/tonc/text/asm.htm (You really want to spend some time with this one)
- http://hilbert-space.de/ (The whole site. He stopped writting way too soon.)
- To your specific question, he explains it here: http://hilbert-space.de/?p=22
- Some more on your specific question: Arm Neon Intrinsics vs hand assembly
- http://wanderingcoder.net/2010/07/19/ought-arm/
- http://wanderingcoder.net/2010/06/02/intro-neon/
- You may also be interested in:
ALL THAT SAID... Always always always start by reconsidering your algorithm. Often the answer is not how to make your loop calculate quickly, it's how to not call the loop so often.
ARM NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide). Your neon implementation already uses at least 18 128-bits wide, so compiler would generate code to move them back and forth from the stack, and that's not good - too much extra memory access.
If you plan to play with assembly, I found it best to use a tool to dump instructions in object files. One is called objdump
in Linux, I believe it is called otool
in Apple world. This way you can actually see how the resulting machine code looks like, and what did compiler do with your functions.
Below is some part of your neon implementation's dump from gcc (-O3) 4.7.1. You can notice loading a quad register via vldmia sp, {d8-d9}
.
1a6: ff24 cee8 vcgt.f32 q6, q10, q12
1aa: ff64 4ec8 vcgt.f32 q10, q10, q4
1ae: ff2e a1dc vbit q5, q15, q6
1b2: ff22 ceea vcgt.f32 q6, q9, q13
1b6: ff5c 41da vbsl q10, q14, q5
1ba: ff20 aeea vcgt.f32 q5, q8, q13
1be: f942 4a8d vst1.32 {d20-d21}, [r2]!
1c2: ec9d 8b04 vldmia sp, {d8-d9}
1c6: ff62 4ee8 vcgt.f32 q10, q9, q12
1ca: f942 6a8f vst1.32 {d22-d23}, [r2]
Of course this all depends on the compiler, a better compiler can avoid this situation by using available registers more clearly.
So at the end you are at the mercy of the compiler if you don't use assembly (inline, standalone) or should continuously check compiler output until you get what you want from it.
As a complement to Rob's answer, that writing NEON is an art in and of itself (thanks for plugging my Wandering Coder posts, by the way), and auselen's answer (that you are indeed having too many registers live at any given time, leading to spilling), I should add that your intrinsics algorithm is more general than the other two: it allows arbitrary ranges, not just multiples, so you are attempting to compare things that are not comparable. Always compare oranges to oranges; with the exception however that it is fair game to compare a custom algorithm more specific than an off-the-shelf generic one if you only need the specific features of the custom one. So that is another way a NEON algorithm can be as slow as a C one: if they are not the same algorithm.
As for your histogramming needs, use what you've constructed with vDSP for the time being and only if the performance is not satisfying for your application, only then, investigate optimizing another way; avenues to do so, besides using NEON instructions, would include avoiding so much memory movement (likely the bottleneck in the vDSP implementation), and incrementing the counters for each bucket as you are browsing the pixels, instead of having this intermediate output made of coerced values. Efficient DSP code is not just about the calculations themselves, but also how to most efficiently use memory bandwidth and so forth. Even more so on mobile: memory I/O, even to the caches, is more power-intensive than in-processor-core operations, so both the memory I/O buses tend to be run at a lower fraction of the processor clock speed, so you don't have that much memory bandwidth to play with, and you should wisely use the memory bandwidth you do have, as any use of it takes power.
来源:https://stackoverflow.com/questions/14926351/c-versus-vdsp-versus-neon-how-could-neon-be-as-slow-as-c