Cortex A9 NEON vs VFP usage confusion

后端 未结 3 1562
日久生厌
日久生厌 2021-01-30 23:22

I\'m trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I\'m in a little bit of confusion regarding which\\when to use NEON vs VFP in the

3条回答
  •  北恋
    北恋 (楼主)
    2021-01-31 00:08

    ... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

    I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

    The NEON register bank consists of 32 64-bit registers. If both Advanced SIMD and VFPv3 are implemented, they share this register bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that supports 32 double-precision floating-point registers. This integration simplifies implementing context switching support, because the same routines that save and restore VFP context also save and restore NEON context.

    The NEON unit can view the same register bank as:

    • sixteen 128-bit quadword registers, Q0-Q15
    • thirty-two 64-bit doubleword registers, D0-D31.

    The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure 1.3 shows the different views of the shared NEON and VFP register bank. All of these views are accessible at any time. Software does not have to explicitly switch between them, because the instruction used determines the appropriate view.

    The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.


    Related to this I'm using the following compilation flags:

    -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
    -O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
    

    Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

    gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.

    BeagleBone Black:

    $ gcc -v 2>&1 | grep Target          
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    model name  : ARMv7 Processor rev 2 (v7l)
    Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
    ...
    

    So the BeagleBone uses:

    -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
    

    CubieTruck v5:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 5 (v7l)
    Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 
    

    So the CubieTruck uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Banana Pi Pro:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 4 (v7l)
    Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
    

    So the Banana Pi uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Raspberry Pi 3:

    The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

    Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo 
    model name  : ARMv7 Processor rev 4 (v7l)
    Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
    ...
    

    So the Raspberry Pi can use:

    -march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
    

    Or it can use (I don't know what to use for -mtune):

    -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard 
    

    ODROID C2:

    ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

    $ gcc -v 2>&1 | grep Target 
    Target: aarch64-linux-gnu
    
    $ cat /proc/cpuinfo 
    Features    : fp asimd evtstrm crc32
    

    So the ODROID uses:

    -march=armv8-a+crc -mtune=cortex-a53
    

    In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:

    CPU part: Part number. 0xd03 indicates Cortex-A53 processor.

    So we may be able to lookup the value form a database. I don't know if it exists or where its located.

提交回复
热议问题