Cortex A9 NEON vs VFP usage confusion

后端 未结 3 1561
日久生厌
日久生厌 2021-01-30 23:22

I\'m trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I\'m in a little bit of confusion regarding which\\when to use NEON vs VFP in the

相关标签:
3条回答
  • 2021-01-30 23:56

    I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.

    But to cover one part of confusion: The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.

    -mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.

    See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

    0 讨论(0)
  • 2021-01-31 00:07

    I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.

    It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.

    Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.

    Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.

    Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.

    Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.

    Just compare the OP's version with mine in the link below, you'll then know what I mean.

    Optimizing RGBA8888 to RGB565 conversion with NEON

    regards

    0 讨论(0)
  • 2021-01-31 00:08

    ... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

    I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

    The NEON register bank consists of 32 64-bit registers. If both Advanced SIMD and VFPv3 are implemented, they share this register bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that supports 32 double-precision floating-point registers. This integration simplifies implementing context switching support, because the same routines that save and restore VFP context also save and restore NEON context.

    The NEON unit can view the same register bank as:

    • sixteen 128-bit quadword registers, Q0-Q15
    • thirty-two 64-bit doubleword registers, D0-D31.

    The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure 1.3 shows the different views of the shared NEON and VFP register bank. All of these views are accessible at any time. Software does not have to explicitly switch between them, because the instruction used determines the appropriate view.

    The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.


    Related to this I'm using the following compilation flags:

    -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
    -O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
    

    Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

    gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.

    BeagleBone Black:

    $ gcc -v 2>&1 | grep Target          
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    model name  : ARMv7 Processor rev 2 (v7l)
    Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
    ...
    

    So the BeagleBone uses:

    -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
    

    CubieTruck v5:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 5 (v7l)
    Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 
    

    So the CubieTruck uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Banana Pi Pro:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 4 (v7l)
    Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
    

    So the Banana Pi uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Raspberry Pi 3:

    The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

    Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo 
    model name  : ARMv7 Processor rev 4 (v7l)
    Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
    ...
    

    So the Raspberry Pi can use:

    -march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
    

    Or it can use (I don't know what to use for -mtune):

    -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard 
    

    ODROID C2:

    ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

    $ gcc -v 2>&1 | grep Target 
    Target: aarch64-linux-gnu
    
    $ cat /proc/cpuinfo 
    Features    : fp asimd evtstrm crc32
    

    So the ODROID uses:

    -march=armv8-a+crc -mtune=cortex-a53
    

    In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:

    CPU part: Part number. 0xd03 indicates Cortex-A53 processor.

    So we may be able to lookup the value form a database. I don't know if it exists or where its located.

    0 讨论(0)
提交回复
热议问题