I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more
I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.
The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:
http://www.peter-teichmann.de/adiv2e.html
This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)
Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.
This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.
Edit:
I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:
Divide.S
.section ".text"
.global udiv64
udiv64:
adds r0,r0,r0
adc r1,r1,r1
.rept 31
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
adc r1,r1,r1
.endr
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
bx lr
C-Code:
extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);
int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
int q;
int sign = (a^b) < 0; /* different signs */
uint32_t l,h;
a = a<0 ? -a:a;
b = b<0 ? -b:b;
l = (a << 24);
h = (a >> 8);
q = udiv64 (l,h,b);
if (sign) q = -q;
return q;
}