问题
For adding two integers, I write:
int sum;
asm volatile("add %0, x3, x4" : "=r"(sum) : :);
How can I do this with two floats? I tried:
float sum;
asm volatile("fadd %0, s3, s4" : "=r"(sum) : :);
But it gives me an error:
Error: operand 1 should be a SIMD vector register -- `fadd x0,s3,s4'
Any ideas?
回答1:
ARMv7 double: %P
modifier
GCC devs informed me the correct undocumented modifier for ARMv7 doubles at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89482#c4 Maybe I should stop being lazy and grep GCC some day:
main.c
#include <assert.h>
int main(void) {
double my_double = 1.5;
__asm__ (
"vmov.f64 d0, 1.0;"
"vadd.f64 %P[my_double], %P[my_double], d0;"
: [my_double] "+w" (my_double)
:
: "d0"
);
assert(my_double == 2.5);
}
Compile and run:
sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-gcc -O3 -std=c99 -ggdb3 -march=armv7-a -marm \
-pedantic -Wall -Wextra -o main.out main.c
qemu-arm -L /usr/arm-linux-gnueabihf main.out
Disassembly contains:
0x00010320 <+4>: 08 7b b7 ee vmov.f64 d7, #120 ; 0x3fc00000 1.5
0x00010324 <+8>: 00 0b b7 ee vmov.f64 d0, #112 ; 0x3f800000 1.0
0x00010328 <+12>: 00 7b 37 ee vadd.f64 d7, d7, d0
Tested in Ubuntu 16.04, GCC 5.4.0, QEMU 2.5.0.
Source code definition point
- ARM: https://github.com/gcc-mirror/gcc/blob/gcc-8_2_0-release/gcc/config/arm/arm.c#L22466
- aarch64: https://github.com/gcc-mirror/gcc/blob/gcc-8_2_0-release/gcc/config/aarch64/aarch64.c#L6743
回答2:
Because registers can have multiple names in AArch64 (v0, b0, h0, s0, d0 all refer to the same register) it is necessary to add an output modifier to the print string:
On Godbolt
float foo()
{
float sum;
asm volatile("fadd %s0, s3, s4" : "=w"(sum) : :);
return sum;
}
double dsum()
{
double sum;
asm volatile("fadd %d0, d3, d4" : "=w"(sum) : :);
return sum;
}
Will produce:
foo:
fadd s0, s3, s4 // sum
ret
dsum:
fadd d0, d3, d4 // sum
ret
回答3:
"=r"
is the constraint for GP integer registers.
The GCC manual claims that "=w"
is the constraint for an FP / SIMD register on AArch64. But if you try that, you get v0
not s0
, which won't assemble. I don't know a workaround here, you should probably report on the gcc bugzilla that the constraint documented in the manual doesn't work for scalar FP.
On Godbolt I tried this source:
float foo()
{
float sum;
#ifdef __aarch64__
asm volatile("fadd %0, s3, s4" : "=w"(sum) : :); // AArch64
#else
asm volatile("fadds %0, s3, s4" : "=t"(sum) : :); // ARM32
#endif
return sum;
}
double dsum()
{
double sum;
#ifdef __aarch64__
asm volatile("fadd %0, d3, d4" : "=w"(sum) : :); // AArch64
#else
asm volatile("faddd %0, d3, d4" : "=w"(sum) : :); // ARM32
#endif
return sum;
}
clang7.0 (with its built-in assembler) requires the asm to be actually valid. But for gcc we're only compiling to asm, and Godbolt doesn't have a "binary mode" for non-x86.
# AArch64 gcc 8.2 -xc -O3 -fverbose-asm -Wall
# INVALID ASM, errors if you try to actually assemble it.
foo:
fadd v0, s3, s4 // sum
ret
dsum:
fadd v0, d3, d4 // sum
ret
clang produces the same asm, and its built-in assembler errors with:
<source>:5:18: error: invalid operand for instruction
asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);
^
<inline asm>:1:11: note: instantiated into assembly here
fadd v0, s3, s4
^
On 32-bit ARM, =t"
for single works, but "=w"
for (which the manual says you should use for double-precision) also gives you s0
with gcc. It works with clang, though. You have to use -mfloat-abi=hard
and a -mcpu=
something with an FPU, e.g. -mcpu=cortex-a15
# clang7.0 -xc -O3 -Wall--target=arm -mcpu=cortex-a15 -mfloat-abi=hard
# valid asm for ARM 32
foo:
vadd.f32 s0, s3, s4
bx lr
dsum:
vadd.f64 d0, d3, d4
bx lr
But gcc fails:
# ARM gcc 8.2 -xc -O3 -fverbose-asm -Wall -mfloat-abi=hard -mcpu=cortex-a15
foo:
fadds s0, s3, s4 @ sum
bx lr @
dsum:
faddd s0, d3, d4 @ sum @@@ INVALID
bx lr @
So you can use =t
for single just fine with gcc, but for double
presumably you need a %something0
modifier to print the register name as d0
instead of s0
, with a "=w"
output.
Obviously these asm statements would only be useful for anything beyond learning the syntax if you add constraints to specify the input operands as well, instead of reading whatever happened to be sitting in s3 and s4.
See also https://stackoverflow.com/tags/inline-assembly/info
来源:https://stackoverflow.com/questions/53960240/armv8-floating-point-output-inline-assembly