I\'m coding a simple primality tester program for Windows in x86 assembly language (MASM32), which involves calculating a square root of a (64-bit) integer. My question is:
There are numerous algorithms and techniques for calculating a square root of a number. In fact calculating the square root using Newton-Raphson's method is a standard assignment for Numeric Analysis students, as an example of the general case of finding the root of an equation.
Without profiling and benchmarking the code, and knowing whether you need a single or multiple square roots (SIMD calculations such as via SSE/SSE2), I would suggest you start with @Smi's answer, which uses the x87 FPU FSQRT
instruction, as your baseline implementation. This does incur a load-store hit (quick summary: moving between FPU and CPU's ALU breaks caching and pipelines) which may negate the advantage of using the built-in FPU instruction.
Since you mentioned prime testing, I'm guessing that the sqrt is only used once per candidate number to determine the search range (any non-trivial factors are between 2 <= f <= sqrt(n), where n is the number being tested for primeness). If you are only testing specific numbers for primality it's okay, but for search lots of numbers you do square root for each candidate. If you are doing a "classic" test (pre- elliptic curve) it may not be worth worrying about.
I think the simplest way is to use the FPU instruction fsqrt
like this:
.data?
int64 dq ?
squareRoot dd ?
.code
fild int64 ;load the integer to ST(0)
fsqrt ;compute square root and store to ST(0)
fistp squareRoot ;store the result in memory (as a 32-bit integer) and pop ST(0)
Finally I wrote, in addition to last function for 80386+ microprocessor's, that processes a 64 Bit unsigned-integer number's square-root, a new optimized function that works for 8086+ microprocessor, in assembly.
I've tested all these routines successfully.
They works as the first routine written by me (see at bottom of this article), but are more optimized. Because is easy, I shows an example of the 16 Bit's algorithm in Pascal as follows (half-section algorithm):
Function NewSqrt(X:Word):Word;
Var L,H,M:LongInt;
Begin
L:=1;
H:=X;
M:=(X+1) ShR 1;
Repeat
If Sqr(M)>X Then
H:=M
Else
L:=M;
M:=(L+H) ShR 1;
Until H-L<2;
NewSqrt:=M;
End;
This my routine works through half-section algorithm and supports the square-root of a 64 Bit unsigned-integer number. Follows the 8086+ CPU new real-mode code:
Function NewSqrt16(XLow,XHigh:LongInt):LongInt; Assembler;
Var X0,X1,X2,X3,
H0,H1,H2,H3,
L0,L1,L2,L3,
M0,M1,M2,M3:Word;
Asm
LEA SI,XLow
Mov AX,[SS:SI]
Mov BX,[SS:SI+2]
LEA SI,XHigh
Mov CX,[SS:SI]
Mov DX,[SS:SI+2]
{------------------------
INPUT: DX:CX:BX:AX = X
OUPUT: DX:AX= Sqrt(X).
------------------------
X0 : X1 : X2 : X3 -> X
H0 : H1 : H2 : H3 -> H
L0 : L1 : L2 : L3 -> L
M0 : M1 : M2 : M3 -> M
AX , BX , CX , DX ,
ES , DI , SI -> Temp. reg.
------------------------}
Mov [X0],AX { ...}
Mov [X1],BX { ...}
Mov [X2],CX { ...}
Mov [X3],DX { Stack <- X}
Mov [H0],AX { ...}
Mov [H1],BX { ...}
Mov [H2],CX { ...}
Mov [H3],DX { Stack <- H= X}
Mov [L0],Word(1) { ...}
Mov [L1],Word(0) { ...}
Mov [L2],Word(0) { ...}
Mov [L3],Word(0) { Stack <- L= 1}
Add AX,1 { ...}
AdC Bx,0 { ...}
AdC Cx,0 { ...}
AdC Dx,0 { X= X+1}
RCR Dx,1 { ...}
RCR Cx,1 { ...}
RCR Bx,1 { ...}
RCR Ax,1 { X= (X+1)/2}
Mov [M0],AX { ...}
Mov [M1],BX { ...}
Mov [M2],CX { ...}
Mov [M3],DX { Stack <- M= (X+1)/2}
{------------------------}
@@LoopBegin: {Loop restart label}
Mov AX,[M3] {If M is more ...}
Or AX,[M2] {... then 32 bit ...}
JNE @@LoadMid {... then Square(M)>X, jump}
{DX:AX:CX:SI= 64 Bit square(Low(M))}
Mov AX,[M0] {AX <- A=Low(M)}
Mov CX,AX {CX <- A=Low(M)}
Mul AX {DX:AX <- A*A}
Mov SI,AX {SI <- Low 16 bit of last mul.}
Mov BX,DX {BX:SI <- A*A}
Mov AX,[M1] {AX <- D=High(M)}
XChg CX,AX {AX <- A=Low(M); CX <- D=High(M)}
Mul CX {DX:AX <- A*D=Low(M)*High(M)}
XOr DI,DI {...}
ShL AX,1 {...}
RCL DX,1 {...}
RCL DI,1 {DI:DX:AX <- A*D+D*A= 2*A*D (33 Bit}
Add AX,BX {...}
AdC DX,0 {...}
AdC DI,0 {DI:DX:AX:SI <- A*(D:A)+(D*A) ShL 16 (49 Bit)}
XChg CX,AX {AX <- D=High(M); CX <- Low 16 bit of last mul.}
Mov BX,DX {DI:BX:CX:SI <- A*(D:A)+(D*A) ShL 16 (49 Bit)}
Mul AX {DX:AX <- D*D}
Add AX,BX {...}
AdC DX,DI {DX:AX:CX:SI <- (D:A)*(D:A)}
{------------------------}
Cmp DX,[X3] {Compare High(Square(M)):High(X)}
@@LoadMid: {Load M in DX:BP:BX:DI}
Mov DI,[M0] {...}
Mov BX,[M1] {...}
Mov ES,[M2] {...}
Mov DX,[M3] {DX:ES:BX:DI <- M}
JA @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
JB @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}
Cmp AX,[X2] {Compare High(Square(M)):High(X)}
JA @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
JB @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}
Cmp CX,[X1] {Compare High(Square(M)):High(X)}
JA @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
JB @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}
Cmp SI,[X0] {Compare Low(Square(M)):Low(X)}
JA @@SqrMIsMoreThenX {If Low(Square(M))>Low(X) then Square(M)>X, jump}
{------------------------}
@@SqrMIsLessThenX: {Square(M)<=X}
Mov [L0],DI {...}
Mov [L1],BX {...}
Mov [L2],ES {...}
Mov [L3],DX {L= M}
Jmp @@ProcessMid {Go to process the mid value}
{------------------------}
@@SqrMIsMoreThenX: {Square(M)>X}
Mov [H0],DI {...}
Mov [H1],BX {...}
Mov [H2],ES {...}
Mov [H3],DX {H= M}
{------------------------}
@@ProcessMid: {Process the mid value}
Mov SI,[H0] {...}
Mov CX,[H1] {...}
Mov AX,[H2] {...}
Mov DX,[H3] {DX:AX:CX:SI <- H}
Mov DI,SI {DI <- H0}
Mov BX,CX {BX <- H1}
Add SI,[L0] {...}
AdC CX,[L1] {...}
AdC AX,[L2] {...}
AdC DX,[L3] {DX:AX:CX:SI <- H+L}
RCR DX,1 {...}
RCR AX,1 {...}
RCR CX,1 {...}
RCR SI,1 {DX:AX:CX:SI <- (H+L)/2}
Mov [M0],SI {...}
Mov [M1],CX {...}
Mov [M2],AX {...}
Mov [M3],DX {M <- DX:AX:CX:SI}
{------------------------}
Mov AX,[H2] {...}
Mov DX,[H3] {DX:AX:BX:DI <- H}
Sub DI,[L0] {...}
SbB BX,[L1] {...}
SbB AX,[L2] {...}
SbB DX,[L3] {DX:AX:BX:DI <- H-L}
Or BX,AX {If (H-L) >= 65536 ...}
Or BX,DX {...}
JNE @@LoopBegin {... Repeat @LoopBegin else goes forward}
Cmp DI,2 {If (H-L) >= 2 ...}
JAE @@LoopBegin {... Repeat @LoopBegin else goes forward}
{------------------------}
Mov AX,[M0] {...}
Mov DX,[M1] {@Result <- Sqrt}
End;
This function in input receives a 64 Bit number (XHigh:XLow) and return the 32 Bit square-root of it. Uses four local variables:
X, the copy of input number, subdivided in four 16 Bit packages (X3:X2:X1:X0).
H, upper limit, subdivided in four 16 Bit packages (H3:H2:H1:H0).
L, lower limit, subdivided in four 16 Bit packages (L3:L2:L1:L0).
M, mid value, subdivided in four 16 Bit packages (M3:M2:M1:M0).
Initializes the lower limit L to 1; initializes the upper limit H to input number X; initializes the mid value M to (H+1)>>1. Test if the square of M is long more then 64 Bit by verifying if (M3 | M2)!=0; if it is true then square(M)>X, sets the upper limit H to M. If it isn't true then processes the square of the lower 32 Bits of M (M1:M0) as follows:
(M1:M0)*(M1:M0)=
M0*M0+((M0*M1)<<16)+((M1*M0)<<16)+((M1*M1)<<32)=
M0*M0+((M0*M1)<<17)+((M1*M1)<<32)
If the square of lower 32 Bits of M is greater then X, sets the upper limit H to M; if the square of lower 32 Bits of M is lower or equal then X value, sets the lower limit L to M. Processes the mid value M setting it to (L+H)>>1. If (H-L)<2 then M is the square root of X, else go to test if the square of M is long more then 64 Bit and runs the follows instructions.
This my routine works through half-section algorithm and supports the square-root of a 64 Bit unsigned-integer number. Follows the 80386+ CPU new protected-mode code:
Procedure NewSqrt32(X:Int64;Var Y:Int64); Assembler;
{INPUT: EDX:EAX = X
TEMP: ECX
OUPUT: Y = Sqrt(X)}
Asm
Push EBX {Save EBX register into the stack}
Push ESI {Save ESI register into the stack}
Push EDI {Save EDI register into the stack}
{------------------------}
Mov EAX,Y {Load address of output var. into EAX reg.}
Push EAX {Save address of output var. into the stack}
{------------------------}
LEA EDX,X {Load address of input var. into EDX reg.}
Mov EAX,[EDX] { EAX <- Low 32 Bit of input value (X)}
Mov EDX,[EDX+4] {EDX:EAX <- Input value (X)}
{------------------------
[ESP+ 4]:[ESP ] -> X
EBX : ECX -> M
ESI : EDI -> L
[ESP+12]:[ESP+ 8] -> H
EAX , EDX -> Temporary registers
------------------------}
Mov EDI,1 { EDI <- Low(L)= 1}
XOr ESI,ESI { ESI <- High(L)= 0}
Mov ECX,EAX { ECX <- Low 32 Bit of input value (X)}
Mov EBX,EDX {EBX:ECX <- M= Input value (X)}
Add ECX,EDI { ECX <- Low 32 Bit of (X+1)}
AdC EBX,ESI {EBX:ECX <- M= X+1}
RCR EBX,1 { EBX <- High 32 Bit of (X+1)/2}
RCR ECX,1 {EBX:ECX <- M= (X+1)/2}
{------------------------}
Push EDX { Stack <- High(H)= High(X)}
Push EAX { Stack <- Low(H)= Low(X)}
Push EDX { Stack <- High(X)}
Push EAX { Stack <- Low(X)}
{------------------------}
@@LoopBegin: {Loop restart label}
Cmp EBX,0 {If M is more then 32 bit ...}
JNE @@SqrMIsMoreThenX {... then Square(M)>X, jump}
Mov EAX,ECX {EAX <- A= Low(M)}
Mul ECX {EDX:EAX <- 64 Bit square(Low(M))}
Cmp EDX,[ESP+4] {Compare High(Square(M)):High(X)}
JA @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
JB @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}
Cmp EAX,[ESP] {Compare Low(Square(M)):Low(X)}
JA @@SqrMIsMoreThenX {If Low(Square(M))>Low(X) then Square(M)>X, jump}
{------------------------}
@@SqrMIsLessThenX: {Square(M)<=X}
Mov EDI,ECX {Set Low 32 Bit of L with Low 32 Bit of M}
Mov ESI,EBX {ESI:EDI <- L= M}
Jmp @@ProcessMid {Go to process the mid value}
{------------------------}
@@SqrMIsMoreThenX: {Square(M)>X}
Mov [ESP+8],ECX {Set Low 32 Bit of H with Low 32 Bit of M}
Mov [ESP+12],EBX {H= M}
{------------------------}
@@ProcessMid: {Process the mid value}
Mov EAX,[ESP+8] {EAX <- Low 32 Bit of H}
Mov EDX,[ESP+12] {EDX:EAX <- H}
Mov ECX,EDI {Set Low 32 Bit of M with Low 32 Bit of L}
Mov EBX,ESI {EBX:ECX <- M= L}
Add ECX,EAX {Set Low 32 Bit of M with Low 32 Bit of (L+H)}
AdC EBX,EDX {EBX:ECX <- M= L+H}
RCR EBX,1 {Set High 32 Bit of M with High 32 Bit of (L+H)/2}
RCR ECX,1 {EBX:ECX <- M= (L+H)/2}
{------------------------}
Sub EAX,EDI {EAX <- Low 32 Bit of (H-L)}
SbB EDX,ESI {EDX:EAX <- H-L}
Cmp EDX,0 {If High 32 Bit of (H-L) aren't zero: (H-L) >= 2 ...}
JNE @@LoopBegin {... Repeat @LoopBegin else goes forward}
Cmp EAX,2 {If (H-L) >= 2 ...}
JAE @@LoopBegin {... Repeat @LoopBegin else goes forward}
{------------------------}
Add ESP,16 {Remove 4 local 32 bit variables from stack}
{------------------------}
Pop EDX {Load from stack the output var. address into EDX reg.}
Mov [EDX],ECX {Set Low 32 Bit of output var. with Low 32 Bit of Sqrt(X)}
Mov [EDX+4],EBX {Y= Sqrt(X)}
{------------------------}
Pop EDI {Retrieve EDI register from the stack}
Pop ESI {Retrieve ESI register from the stack}
Pop EBX {Retrieve EBX register from the stack}
End;
This my 8086+ CPU real-mode routine works through half-section algorithm and supports the square-root of a 32 Bit unsigned-integer number; is not too fast, but may be optimized:
Procedure _PosLongIMul2; Assembler;
{INPUT:
DX:AX-> First factor (destroyed).
BX:CX-> Second factor (destroyed).
OUTPUT:
BX:CX:DX:AX-> Multiplication result.
TEMP:
BP, Di, Si}
Asm
Jmp @Go
@VR:DD 0 {COPY of RESULT (LOW)}
DD 0 {COPY of RESULT (HIGH)}
@Go:Push BP
Mov BP,20H {32 Bit Op.}
XOr DI,DI {COPY of first op. (LOW)}
XOr SI,SI {COPY of first op. (HIGH)}
Mov [CS:OffSet @VR ],Word(0)
Mov [CS:OffSet @VR+2],Word(0)
Mov [CS:OffSet @VR+4],Word(0)
Mov [CS:OffSet @VR+6],Word(0)
@01:ShR BX,1
RCR CX,1
JAE @00
Add [CS:OffSet @VR ],AX
AdC [CS:OffSet @VR+2],DX
AdC [CS:OffSet @VR+4],DI
AdC [CS:OffSet @VR+6],SI
@00:ShL AX,1
RCL DX,1
RCL DI,1
RCL SI,1
Dec BP
JNE @01
Mov AX,[CS:OffSet @VR]
Mov DX,[CS:OffSet @VR+2]
Mov CX,[CS:OffSet @VR+4]
Mov BX,[CS:OffSet @VR+6]
Pop BP
End;
Function _Sqrt:LongInt; Assembler;
{INPUT:
Di:Si-> Unsigned integer input number X.
OUTPUT:
DX:AX-> Square root Y=Sqrt(X).
TEMP:
BX, CX}
Asm
Jmp @Go
@Vr:DD 0 {+0 LOW}
DD 0 {+4 HIGH}
DD 0 {+8 Mid}
DB 0 {+12 COUNT}
@Go:Mov [CS:OffSet @Vr],Word(0)
Mov [CS:OffSet @Vr+2],Word(0)
Mov [CS:OffSet @Vr+4],SI
Mov [CS:OffSet @Vr+6],DI
Mov [CS:OffSet @Vr+8],Word(0)
Mov [CS:OffSet @Vr+10],Word(0)
Mov [CS:OffSet @Vr+12],Byte(32)
@02:Mov AX,[CS:OffSet @Vr]
Mov DX,[CS:OffSet @Vr+2]
Mov CX,[CS:OffSet @Vr+4]
Mov BX,[CS:OffSet @Vr+6]
Add AX,CX
AdC DX,BX
ShR DX,1
RCR AX,1
Mov [CS:OffSet @Vr+8],AX
Mov [CS:OffSet @Vr+10],DX
Mov CX,AX
Mov BX,DX
Push DI
Push SI
Call _PosLongIMul2
Pop SI
Pop DI
Or BX,CX
JNE @00
Cmp DX,DI
JA @00
JB @01
Cmp AX,SI
JA @00
@01:Mov AX,[CS:OffSet @Vr+8]
Mov [CS:OffSet @Vr],AX
Mov DX,[CS:OffSet @Vr+10]
Mov [CS:OffSet @Vr+2],DX
Dec Byte Ptr [CS:OffSet @Vr+12]
JNE @02
JE @03
@00:Mov AX,[CS:OffSet @Vr+8]
Mov [CS:OffSet @Vr+4],AX
Mov DX,[CS:OffSet @Vr+10]
Mov [CS:OffSet @Vr+6],DX
Dec Byte Ptr [CS:OffSet @Vr+12]
JNE @02
@03:Mov AX,[CS:OffSet @Vr+8]
Mov DX,[CS:OffSet @Vr+10]
End;
This my 8086+ CPU real-mode routine returns a 32 Bit fixed-point number that is the square-root of a 16 Bit unsigned-integer input number; is not too fast, but may be optimized:
Function _Sqrt2:LongInt; Assembler;
{INPUT:
Si-> Unsigned integer number X (unaltered).
OUTPUT:
AX-> Integer part of Sqrt(X).
DX-> Decimal part of Sqrt(X).
TEMP:
BX, CX, DX, Di}
Asm
Jmp @Go
@Vr:DD 0 {+0 LOW}
DD 0 {+4 HIGH}
DD 0 {+8 Mid}
DB 0 {+12 COUNT}
@Go:Mov [CS:OffSet @Vr],Word(0)
Mov [CS:OffSet @Vr+2],Word(0)
Mov [CS:OffSet @Vr+4],Word(0)
Mov [CS:OffSet @Vr+6],SI
Mov [CS:OffSet @Vr+8],Word(0)
Mov [CS:OffSet @Vr+10],Word(0)
Mov [CS:OffSet @Vr+12],Byte(32)
@02:Mov AX,[CS:OffSet @Vr]
Mov DX,[CS:OffSet @Vr+2]
Mov CX,[CS:OffSet @Vr+4]
Mov BX,[CS:OffSet @Vr+6]
Add AX,CX
AdC DX,BX
ShR DX,1
RCR AX,1
Mov [CS:OffSet @Vr+8],AX
Mov [CS:OffSet @Vr+10],DX
Mov CX,AX
Mov BX,DX
Push SI
Call _PosLongIMul2
Pop SI
Or BX,BX
JNE @00
Cmp CX,SI
JB @01
JA @00
Or DX,AX
JNE @00
@01:Mov AX,[CS:OffSet @Vr+8]
Mov [CS:OffSet @Vr],AX
Mov DX,[CS:OffSet @Vr+10]
Mov [CS:OffSet @Vr+2],DX
Dec Byte Ptr [CS:OffSet @Vr+12]
JNE @02
JE @03
@00:Mov AX,[CS:OffSet @Vr+8]
Mov [CS:OffSet @Vr+4],AX
Mov DX,[CS:OffSet @Vr+10]
Mov [CS:OffSet @Vr+6],DX
Dec Byte Ptr [CS:OffSet @Vr+12]
JNE @02
@03:Mov DX,[CS:OffSet @Vr+8]
Mov AX,[CS:OffSet @Vr+10]
End;
Hi!
Here is my original 64-bit Square root routine:
// X:int64
// sqrt64(X)
....
asm
push ESI {preserve ESI,EDI and EBX}
push EDI
push EBX
mov ESI,dword ptr X
mov EDI,dword ptr X+4
xor Eax,Eax
xor Ebx,Ebx
mov cx,32
@next:
// Add ESI,ESI //old RCL ESI,1 - Peter Cordes suggestion
// ADC EDI,EDI //old RCL EDI,1 ~1.38x faster!
// ADC EBX,EBX //old RCL Ebx,1
//Add ESI,ESI //old RCL ESI,1
//ADC EDI,EDI //old RCL EDI,1
//ADC EBX,EBX //old RCL Ebx,1
shld ebx, edi, 2 //- Peter Cordes 41% speed up!
shld edi, esi, 2
lea esi, [esi*4]
//mov EDX,EAX
//Add EDX,EDX //old shl Edx,1
//stc
//ADC EDX,EDX //old RCL Edx,1 {+01}
lea edx, [eax*4 + 1] //- Peter Cordes +20% speed up
cmp EBX,EDX {if BX>=DX -> BX-DX}
JC @skip
sub EBX,EDX
@skip:
cmc {invert C}
ADC EAX,EAX //old RCL Eax,1
dec cx
jnz @next // - Peter Cordes +40% speed up
//LOOP @next
pop EBX
pop EDI
pop ESI
mov result,Eax
end;
....
You mean other than by using the FSQRT instruction? Sure, it's floating-point, but it should be easy enough to do the conversion. Otherwise you have to use something like Newton's Method.
For a naive prime-testing loop condition, you can use the quotient from trial division to find out when your divisor has passed the sqrt: if (x / d > d) break;
Checking if a number is prime in NASM Win64 Assembly/ has more detail, including an implementation of that in x86 asm. You're doing that division anyway to check for remainder == 0
, so you get the quotient for free.
In 64-bit code, normal integer division can be 64-bit. (But 64-bit operand-size for div
is a lot slower than 32-bit on Intel so use 32-bit when possible.)