64bit/32bit division faster algorithm for ARM / NEON?

前端未结

关注

 1  1822

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more

相关标签:

1条回答

生来不讨喜

2020-12-17 00:14
I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:

http://www.peter-teichmann.de/adiv2e.html

This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.

This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.

Edit:

I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:

Divide.S
```
.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr
```
C-Code:
```
extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...