问题
I am implementing a function in assembly x86 64 bits, which I am unable to alter:
unsigned long rotate(unsigned long val, unsigned long num, unsigned long direction);
direction- 1 is left and 0 is right.
This is my code to shift left but its not working the last bit is off. Can someone help me please.
rotate:
push rbp
push rdi
push rsi
push rdx
mov rbp, rsp
sub rsp, 16
cmp rdx, 1
je shift_left
shift_left:
mov rax, rdi
shl rax, cl
mov rax, rax
mov rcx, rdi
sub cl, 64
shl rcx, cl
or rax rdx
mov rax, rax
add rsp, 16
#I pop all the registers used and ret
回答1:
x86 has rotate instructions. Use rol rax, cl to rotate left, and ror rax, cl
to rotate right.
It seems you didn't realize that cl
is the low byte of rcx
/ ecx
. Thus shl rcx, cl
is shifting the shift-count. Your function is over-complicated, but that's normal when you're just learning. It takes practice to find the simple underlying problem that you can implement in few instructions.
Also, I think mov rcx, rdi
was supposed to be mov rcx, rsi
. IDK what mov rax,rax
was supposed to be; its just a no-op.
It would be significantly more efficient to call different functions for rotate-left vs. rotate-right, unless you actually need direction
to be a runtime variable that isn't just a build-time constant 1 or 0.
Or to make it branchless, conditionally do cl = 64-cl
, because a left-rotate by n
is the same thing as a right-rotate by 64-n
. And because rotate instructions mask the count (and rotate is modular anyway), you can actually just do -n
instead of 64-n
. (See Best practices for circular shift (rotate) operations in C++ for some C that uses -n
instead of 32-n
, and compiles to a single rotate instruction).
TL:DR Because of rotate symmetry, you can rotate in the other direction just by negating the count. As @njuffa points out, you could have written the function with a signed shift count where negative means rotate the other way, so the caller would pass you num
or -num
in the first place.
Note that in your code, sub cl, 64
has no effect on the shift count of the next shl
, because 64-bit shl
already masks the count with cl & 63
.
I made a C version to see what compilers would do (on the Godbolt compiler explorer). gcc has an interesting idea: rotate both ways and use a cmov
to pick the right result. This kinda sucks because variable-count shifts/rotates are 3 uops on Intel SnB-family CPUs. (Because they have to leave the flags unmodified if the count turns out to be 0
. See the shift section of this answer, all of it applies to rotates as well.)
Unfortunately BMI2 only added an immediate-count version of rorx
, and variable-count shlx
/shrx
, not variable-count no-flags rotate.
Anyway, based on those ideas, here's a good way to implement your function for the x86-64 System V ABI / calling convention (where functions are allowed to clobber the input-arg registers and r10
/ r11
). I assume you're on a platform that uses the x86-64 SysV ABI (like Linux or OS X) because you appear to be using rdi
, rsi
, and rdx
for the first 3 args (or at least trying to), and your long
is 64 bits.
;; untested
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
xor ecx, ecx
sub ecx, esi ; -num
test edx, edx
mov rax, rdi ; put val in the retval register
cmovnz ecx, esi ; cl = direction ? num : -num
rol rax, cl ; works as a rotate-right by 64-num if direction is 0
ret
xor-zero / sub is often better than mov / neg because the xor-zeroing is off the critical path. mov
/ neg
is better on Ryzen, though, which has zero-latency integer mov
and still needs an ALU uop to do xor-zeroing. But if ALU uops aren't your bottleneck, this is still fine. It's a clear win on Intel Sandybridge (where xor
-zeroing is as cheap as a NOP), and also a latency win on other CPUs that don't have zero-latency mov
(like Silvermont/KNL, or AMD Bulldozer-family).
cmov
is 2 uops on Intel pre-Broadwell. A 2's complement bithack alternative to xor/sub/test/cmov might be just as good if not better. -num = ~num + 1
.
rotate:
dec edx ; convert direction = 0 / 1 into -1 / 0
mov ecx, esi ; couldn't figure out how to avoid this with lea ecx, [rdx-1] or something
xor ecx, edx ; (direction==0) ? ~num : num ; NOT = xor with all-ones
sub ecx, edx ; (direction==0) ? ~num + 1 : num + 0;
; conditional negation using -num = ~num + 1. (subtracting -1 is the same as adding 1)
mov rax, rdi ; put val in the retval register
rol rax, cl ; works as a rotate-right by 64-num if direction is 0
ret
This would have more of an advantage if inlined so num
could already be in ecx
, making this shorter than the other options (in code-size and uop count).
Latency on Haswell
- From
direction
being ready tocl
being ready forrol
: 3 cycles (dec
/xor
/sub
). Same astest
/cmov
in the other version. (But on Broadwell/Skylaketest
/cmov
only has 2 cycle latency fromdirection
tocl
) - From
num
being ready tocl
being ready: 2 cycles:mov
(0) +xor
(1) +sub
(1), so there's room fornum
to be ready 1 cycle later. This is better than withcmov
on Haswell where it'ssub
(1) +cmov
(2) = 3 cycles. But on Broadwell/Skylake, it's only 2c either way.
The total front-end uop count is better on pre-Broadwell, because we avoid cmov
. We traded an xor
-zeroing for a mov
, which is worse on Sandybridge, but about equal everywhere else. (Except that it's on the critical path for num
, which matters for CPUs without zero-latency mov
.)
BTW, a branching implementation could actually be faster if the branch on direction
is very predictable. But usually that means it would have been better to just inline a rol
or ror
instruction.
Or this one: gcc's output with the redundant and ecx, 63
removed. It should be pretty good on some CPUs, but doesn't have much advantage compared to the above. (And is clearly worse on mainstream Intel Sandybridge-family CPUs including Skylake.)
;; not good on Intel SnB-family
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
mov ecx, esi
mov rax, rdi
rol rax, cl ; 3 uops
ror rdi, cl ; false-dependency on flags on Intel SnB-family
test edx, edx ; look at the low 32 bits for 0 / non-0
cmovz rax, rdi ; direction=0 means use the rotate-right result
ret
The false dependency is only for the flag-setting uops; I think the rdi
result of ror rdi,cl
is independent of the flag-merge uop of the preceding rol rax,cl
. (See SHL/SHR r,cl latency is lower than throughput). But all the uops require p0 or p6, so there will be resource conflicts that limit instruction-level parallelism.
Using rotate(unsigned long val, int left_count)
Caller passes you a signed rotate count in edi
. Or call it rdi
if you want; you ignore all but the low 6 bits of it, and you actually just do a left-rotate in the range [0, 63
, but that's the same as supporting left and right rotates with range [-63, +63]
. (With larger values wrapping into that range).
e.g. an arg of -32
is 0xffffffe0
, which masks down to 0x20
, which is 32. Rotating by 32 in either direction is the same operation.
rotate:
mov rax, rdi
mov ecx, esi
rol rax, cl
ret
The only way this could be any more efficient is inlining into the caller to avoid the mov
and call
/ret
instructions. (Or for constant-count rotates, using an immediate rotate count which makes it a single-uop instruction on Intel CPUs.)
来源:https://stackoverflow.com/questions/47396960/how-do-i-rotate-a-value-in-assembly