I'm working on Cortex-A8 and Cortex-A9 in particular. I know that some architectures don't come with integer division, but what is the best way to do it other than convert to float, divide, convert to integer? Or is that indeed the best solution?
Cheers! = )
The compiler normally includes a divide in its library, gcclib for example I have extracted them from gcc and use them directly:
https://github.com/dwelch67/stm32vld/ then stm32f4d/adventure/gcclib
going to float and back is probably not the best solution. you can try it and see how fast it is...This is a multiply but could as easily make it a divide:
https://github.com/dwelch67/stm32vld/ then stm32f4d/float01/vectors.s
I didnt time it though to see how fast/slow. Understood I am using a cortex-m above and you are talking about a cortex-a, different ends of the spectrum, similar float instructions and the gcc lib stuff is similar, for the cortex-m I have to build for thumb but you can just as easily build for arm. Actually with gcc it should all just work automagically you should not need to do it the way I did it. Other compilers as well you should not need to do it the way I did it in the adventure game above.
Division by a constant value is done quickly by doing a 64bit-multiply and shift-right, for example, like this:
LDR R3, =0xA151C331
UMULL R3, R2, R1, R3
MOV R0, R2,LSR#10
here R1 is divided by 1625. The calculation is done like this: 64bitreg(R2:R3) = R1*0xA151C331, then the result is the upper 32bit right shifted by 10:
R1*0xA151C331/2^(32+10) = R1*0.00061538461545751488 = R1/1624.99999980
You can calculate your own constants from this formula:
x / N == (x*A)/2^(32+n) --> A = 2^(32+n)/N
select the largest n, for which A < 2^32
Some copy-pasta from elsewhere for an integer divide: Basically, 3 instructions per bit. From this website, though I've seen it many other places as well. This site also has a nice version which may be faster in general.
@ Entry r0: numerator (lo) must be signed positive
@ r2: deniminator (den) must be non-zero and signed negative
idiv:
lo .req r0; hi .req r1; den .req r2
mov hi, #0 @ hi = 0
adds lo, lo, lo
.rept 32 @ repeat 32 times
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
.endr
mov pc, lr @ return
@ Exit r0: quotient (lo)
@ r1: remainder (hi)
I wrote my own routine to perform an unsigned division as I could not find an unsigned version on the web. I needed to divide a 64 bit value with a 32 bit value to get a 32 bit result.
The inner loop is not as efficient as the signed solution provided above, but this does support unsigned arithmetic. This routine performs a 32 bit division if the high part of the numerator (hi) is smaller than the denominator (den), otherwise a full 64 bit division is performed (hi:lo/den). The result is in lo.
cmp hi, den // if hi < den do 32 bits, else 64 bits
bpl do64bits
REPT 32
adds lo, lo, lo // shift numerator through carry
adcs hi, hi, hi
subscc work, hi, den // if carry not set, compare
subcs hi, hi, den // if carry set, subtract
addcs lo, lo, #1 // if carry set, and 1 to quotient
ENDR
mov r0, lo // move result into R0
mov pc, lr // return
do64bits:
mov top, #0
REPT 64
adds lo, lo, lo // shift numerator through carry
adcs hi, hi, hi
adcs top, top, top
subscc work, top, den // if carry not set, compare
subcs top, top, den // if carry set, subtract
addcs lo, lo, #1 // if carry set, and 1 to quotient
ENDR
mov r0, lo // move result into R0
mov pc, lr // return
Extra checking for boundary conditions and power of 2 can be added. Full details can be found at http://www.idwiz.co.za/Tips%20and%20Tricks/Divide.htm
I wrote the following functions for the ARM GNU
assembler. If you don't have a CPU with udiv/sdiv
machine support, just cut out the first few lines up to the "0:" label in either function.
.arm
.cpu cortex-a7
.syntax unified
.type udiv,%function
.globl udiv
udiv: tst r1,r1
bne 0f
udiv r3,r0,r2
mls r1,r2,r3,r0
mov r0,r3
bx lr
0: cmp r1,r2
movhs r1,r2
bxhs lr
mvn r3,0
1: adds r0,r0
adcs r1,r1
cmpcc r1,r2
subcs r1,r2
orrcs r0,1
lsls r3,1
bne 1b
bx lr
.size udiv,.-udiv
.type sdiv,%function
.globl sdiv
sdiv: teq r1,r0,ASR 31
bne 0f
sdiv r3,r0,r2
mls r1,r2,r3,r0
mov r0,r3
bx lr
0: mov r3,2
adds r0,r0
and r3,r3,r1,LSR 30
adcs r1,r1
orr r3,r3,r2,LSR 31
movvs r1,r2
ldrvc pc,[pc,r3,LSL 2]
bx lr
.int 1f
.int 3f
.int 5f
.int 11f
1: cmp r1,r2
movge r1,r2
bxge lr
mvn r3,1
2: adds r0,r0
adcs r1,r1
cmpvc r1,r2
subge r1,r2
orrge r0,1
lsls r3,1
bne 2b
bx lr
3: cmn r1,r2
movge r1,r2
bxge lr
mvn r3,1
4: adds r0,r0
adcs r1,r1
cmnvc r1,r2
addge r1,r2
orrge r0,1
lsls r3,1
bne 4b
rsb r0,0
bx lr
5: cmn r1,r2
blt 6f
tsteq r0,r0
bne 7f
6: mov r1,r2
bx lr
7: mvn r3,1
8: adds r0,r0
adcs r1,r1
cmnvc r1,r2
blt 9f
tsteq r0,r3
bne 10f
9: add r1,r2
orr r0,1
10: lsls r3,1
bne 8b
rsb r0,0
bx lr
11: cmp r1,r2
blt 12f
tsteq r0,r0
bne 13f
12: mov r1,r2
bx lr
13: mvn r3,1
14: adds r0,r0
adcs r1,r1
cmpvc r1,r2
blt 15f
tsteq r0,r3
bne 16f
15: sub r1,r2
orr r0,1
16: lsls r3,1
bne 14b
bx lr
There are two functions, udiv
for unsigned integer division and sdiv
for signed integer division. They both expect a 64-bit dividend (either signed or unsigned) in r1
(high word) and r0
(low word), and a 32-bit divisor in r2
. They return the quotient in r0
and the remainder in r1
, thus you can define them in a C header
as extern
returning a 64-bit integer and mask out the quotient and remainder afterwards. An error (division by 0 or overflow) is indicated by a remainder having an absolute value greater than or equal the absolute value of the divisor. The signed division algorithm uses case distinction by the signs of both dividend and divisor; it does not convert to positive integers first, since that wouldn't detect all overflow conditions properly.
来源:https://stackoverflow.com/questions/8348030/how-does-one-do-integer-signed-or-unsigned-division-on-arm