Cortex A9 NEON vs VFP usage confusion

£可爱£侵袭症+ 提交于 2019-12-02 16:21:27
unixsmurf

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.

But to cover one part of confusion: The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.

-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.

See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

jww

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

The NEON register bank consists of 32 64-bit registers. If both Advanced SIMD and VFPv3 are implemented, they share this register bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that supports 32 double-precision floating-point registers. This integration simplifies implementing context switching support, because the same routines that save and restore VFP context also save and restore NEON context.

The NEON unit can view the same register bank as:

  • sixteen 128-bit quadword registers, Q0-Q15
  • thirty-two 64-bit doubleword registers, D0-D31.

The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure 1.3 shows the different views of the shared NEON and VFP register bank. All of these views are accessible at any time. Software does not have to explicitly switch between them, because the instruction used determines the appropriate view.

The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.


Related to this I'm using the following compilation flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp

Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.

BeagleBone Black:

$ gcc -v 2>&1 | grep Target          
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
model name  : ARMv7 Processor rev 2 (v7l)
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
...

So the BeagleBone uses:

-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard

CubieTruck v5:

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 5 (v7l)
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 

So the CubieTruck uses:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Banana Pi Pro:

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 4 (v7l)
Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt

So the Banana Pi uses:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Raspberry Pi 3:

The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo 
model name  : ARMv7 Processor rev 4 (v7l)
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...

So the Raspberry Pi can use:

-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard

Or it can use (I don't know what to use for -mtune):

-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard 

ODROID C2:

ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

$ gcc -v 2>&1 | grep Target 
Target: aarch64-linux-gnu

$ cat /proc/cpuinfo 
Features    : fp asimd evtstrm crc32

So the ODROID uses:

-march=armv8-a+crc -mtune=cortex-a53

In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:

CPU part: Part number. 0xd03 indicates Cortex-A53 processor.

So we may be able to lookup the value form a database. I don't know if it exists or where its located.

Jake 'Alquimista' LEE

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.

It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.

Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.

Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.

Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.

Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.

Just compare the OP's version with mine in the link below, you'll then know what I mean.

Optimizing RGBA8888 to RGB565 conversion with NEON

regards

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!