ARMv8 floating point output inline assembly

烂漫一生 提交于 2019-12-24 00:25:58

问题


For adding two integers, I write:

int sum;
asm volatile("add %0, x3, x4" : "=r"(sum) : :);

How can I do this with two floats? I tried:

float sum;
asm volatile("fadd %0, s3, s4" : "=r"(sum) : :);

But it gives me an error:

Error: operand 1 should be a SIMD vector register -- `fadd x0,s3,s4'

Any ideas?


回答1:


ARMv7 double: %P modifier

GCC devs informed me the correct undocumented modifier for ARMv7 doubles at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89482#c4 Maybe I should stop being lazy and grep GCC some day:

main.c

#include <assert.h>

int main(void) {
    double my_double = 1.5;
    __asm__ (
        "vmov.f64 d0, 1.0;"
        "vadd.f64 %P[my_double], %P[my_double], d0;"
        : [my_double] "+w" (my_double)
        :
        : "d0"
    );
    assert(my_double == 2.5);
}

Compile and run:

sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
arm-linux-gnueabihf-gcc -O3 -std=c99 -ggdb3 -march=armv7-a -marm \
  -pedantic -Wall -Wextra -o main.out main.c
qemu-arm -L /usr/arm-linux-gnueabihf main.out

Disassembly contains:

   0x00010320 <+4>:     08 7b b7 ee     vmov.f64        d7, #120        ; 0x3fc00000  1.5
   0x00010324 <+8>:     00 0b b7 ee     vmov.f64        d0, #112        ; 0x3f800000  1.0
   0x00010328 <+12>:    00 7b 37 ee     vadd.f64        d7, d7, d0

Tested in Ubuntu 16.04, GCC 5.4.0, QEMU 2.5.0.

Source code definition point

  • ARM: https://github.com/gcc-mirror/gcc/blob/gcc-8_2_0-release/gcc/config/arm/arm.c#L22466
  • aarch64: https://github.com/gcc-mirror/gcc/blob/gcc-8_2_0-release/gcc/config/aarch64/aarch64.c#L6743



回答2:


Because registers can have multiple names in AArch64 (v0, b0, h0, s0, d0 all refer to the same register) it is necessary to add an output modifier to the print string:

On Godbolt

float foo()
{
    float sum;
    asm volatile("fadd %s0, s3, s4" : "=w"(sum) : :);
    return sum;
}

double dsum()
{
    double sum;
    asm volatile("fadd %d0, d3, d4" : "=w"(sum) : :);
    return sum;
}

Will produce:

foo:
        fadd s0, s3, s4 // sum
        ret     
dsum:
        fadd d0, d3, d4 // sum
        ret  



回答3:


"=r" is the constraint for GP integer registers.

The GCC manual claims that "=w" is the constraint for an FP / SIMD register on AArch64. But if you try that, you get v0 not s0, which won't assemble. I don't know a workaround here, you should probably report on the gcc bugzilla that the constraint documented in the manual doesn't work for scalar FP.

On Godbolt I tried this source:

float foo()
{
    float sum;
#ifdef __aarch64__
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("fadds %0, s3, s4" : "=t"(sum) : :);  // ARM32
#endif
    return sum;
}

double dsum()
{
    double sum;
#ifdef __aarch64__
    asm volatile("fadd %0, d3, d4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("faddd %0, d3, d4" : "=w"(sum) : :);  // ARM32
#endif
    return sum;
}

clang7.0 (with its built-in assembler) requires the asm to be actually valid. But for gcc we're only compiling to asm, and Godbolt doesn't have a "binary mode" for non-x86.

# AArch64 gcc 8.2  -xc -O3 -fverbose-asm -Wall
# INVALID ASM, errors if you try to actually assemble it.
foo:
    fadd v0, s3, s4 // sum
    ret     
dsum:
    fadd v0, d3, d4 // sum
    ret

clang produces the same asm, and its built-in assembler errors with:

<source>:5:18: error: invalid operand for instruction
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);
                 ^
<inline asm>:1:11: note: instantiated into assembly here
        fadd v0, s3, s4
             ^

On 32-bit ARM, =t" for single works, but "=w" for (which the manual says you should use for double-precision) also gives you s0 with gcc. It works with clang, though. You have to use -mfloat-abi=hard and a -mcpu= something with an FPU, e.g. -mcpu=cortex-a15

# clang7.0 -xc -O3 -Wall--target=arm -mcpu=cortex-a15 -mfloat-abi=hard
# valid asm for ARM 32
foo:
        vadd.f32        s0, s3, s4
        bx      lr
dsum:
        vadd.f64        d0, d3, d4
        bx      lr

But gcc fails:

# ARM gcc 8.2  -xc -O3 -fverbose-asm -Wall -mfloat-abi=hard -mcpu=cortex-a15
foo:
        fadds s0, s3, s4        @ sum
        bx      lr  @
dsum:
        faddd s0, d3, d4        @ sum    @@@ INVALID
        bx      lr  @

So you can use =t for single just fine with gcc, but for double presumably you need a %something0 modifier to print the register name as d0 instead of s0, with a "=w" output.


Obviously these asm statements would only be useful for anything beyond learning the syntax if you add constraints to specify the input operands as well, instead of reading whatever happened to be sitting in s3 and s4.

See also https://stackoverflow.com/tags/inline-assembly/info



来源:https://stackoverflow.com/questions/53960240/armv8-floating-point-output-inline-assembly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!