neon

Enable neon on ARM cortex-a series

萝らか妹 提交于 2019-12-11 14:46:31
问题 I want to initialize on a bare metal cortex A-15 the NEON cp. After following ARM's directives I wrote this sequence at the end of my platform init sequence: MOV r0, #0x00F00000 MRC p15, 0, r0, c1, c1, 2 ORR r0, r0, #0x0C00 BIC r0, r0, #0xC000 MCR p15, 0, r0, c1, c1, 2 ISB MRC p15, 4, r0, c1, c1, 2 BIC r0, r0, #0x0C00 BIC r0, r0, #(3<<14) MCR p15, 4, r0, c1, c1, 2 ISB MOV r3, #0x40000000 VMSR FPEXC, r3 I get this error: Error: operand 0 must be FPSCR -- `vmsr FPEXC,r3' I am using arm-eabi-as

How to use arm_acle C language extensions in android

橙三吉。 提交于 2019-12-11 12:40:15
问题 There are lots of examples of using arm neon intrinsics for android, with the ndk even having an example. I've gotten that to work with no problem. Arm also offer the ACLE (Arm C Language Extension), but I can find next to nothing by way of examples. The arm document itself merely suggests including the arm_acle.h header file, however I still get errors. Google has offered almost zero assistance :) Also searching the arm community boards has yielded little by way of results. Do people not use

depth transformation with ARM neon intrinsics

我怕爱的太早我们不能终老 提交于 2019-12-11 11:46:58
问题 I'm trying to wrap my head around NEON intrinsics, and figured I could start with an example and ask some questions. In this experiment I want to convert 32bit RGB to 16bit BGR. What would be a good start in converting the following code to use NEON intrinsics? The problem I'm having here is that 16bit doesn't match any intrinsic that I can see. There's 16x4 16x8, etc.. but I'm just having little luck wrapping my thoughts around how I need to approach this. Any tips? Here's the code I'm

How to set optimization level for a specific file in Android NDK?

≡放荡痞女 提交于 2019-12-11 11:06:39
问题 I have a native library for Android that has some files that include NEON assembly code. I've inherited this code from some other coder, and given my knowledge regarding NEON assembly coding (or any assembly, for that matter) is skimpy, to say the least. Anyhow, I've noticed the following problem: when I compile with 'ndk-build NDK_DEBUG=1', all is fine. When I compile for release, 'ndk-build NDK_DEBUG=0', the compiler optimizes away the assembly code. I've managed to work around the problem

Find min and position of the min element in uint8x8_t neon register

折月煮酒 提交于 2019-12-11 09:49:15
问题 consider this piece of code: uint8_t v[8] = { ... }; int ret = 256; int ret_pos = -1; for (int i=0; i<8; ++i) { if (v[i] < ret) { ret = v[i]; ret_pos = i; } } It finds min and position of the min element ( ret and ret_pos ). In arm neon I could use pairwise min to find min element in v, but how do I find position of the min element? Update: see my own answer, what would you suggest to improve it? 回答1: Pairwise min will allow you to compare between 2 vectors, to find the minimum between each

Intrinsics Neon Swap elements in vector

给你一囗甜甜゛ 提交于 2019-12-11 05:07:55
问题 I would like to optimize such code with Neon Intrinsics. Basically with given input of 0 1 2 3 4 5 6 7 8 will produce the output, 2 1 0 5 4 3 8 7 6 void func(uint8_t* src, uint8_t* dst, int size){ for (int i = 0; i < size; i++){ dst[0] = src[2]; dst[1] = src[1]; dst[2] = src[0] dst = dst+3; src = src+3; } } The only way I can think of is to use uint8x8x3_t src = vld3_u8(src); to get 3 vectors and then access every single element from src[2], src[1], src[0] and write to the memory. Can someone

ARM inline assembly code with error “impossible constraint in asm”

旧城冷巷雨未停 提交于 2019-12-11 05:07:26
问题 I am trying to optimize the following code complex.cpp: typedef struct { float re; float im; } dcmplx; dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf) { int i; dcmplx z, xout; xout.re = xout.im = 0.0; asm volatile ( "movs r3, #0\n\t" ".loop:\n\t" "vldr s11, [%[hat], #4]\n\t" "vldr s13, [%[hat]]\n\t" "vneg.f32 s11, s11\n\t" "vldr s15, [%[buf], #4]\n\t" "vldr s12, [%[buf]]\n\t" "vmul.f32 s14, s15, s13\n\t" "vmul.f32 s15, s11, s15\n\t" "adds %[hat], #8\n\t" "vmla.f32 s14, s11, s12\n\t"

Testing NEON SIMD registers for equality over all lanes

巧了我就是萌 提交于 2019-12-11 04:34:54
问题 I'm using Neon Instrinics with clang. I want to test two uint32x4_t SIMD values for equality over all lanes. So not 4 test results, but one single result that tells me if A and B are equal for all lanes. On Intel AVX, I would use something like: _mm256_testz_si256( _mm256_xor_si256( A, B ), _mm256_set1_epi64x( -1 ) ) What would be a good way to perform an all-lane equality test for NEON SIMD? I am assuming I will need intrinsics that operate across lanes. Does ARM Neon have those features?

Is numpy optimized for raspberry-pi automatically

大兔子大兔子 提交于 2019-12-10 20:01:42
问题 The Raspberry Pi ( armv7l architecture ) has neon vfpv4 support which can be used for optimization. Does the standard version of numpy include these optimizations when installing the command pip3 install numpy or apt-get python3-numpy ? I am not talking about blas and lapack. Native numpy. 回答1: As Mark Setchell noted, numpy does not appear to have specific code that targets NEON intrinsics. However, that is not the full story. Modern compilers are frequently able to take serially written code

Porting ARM NEON code to AARCH64, many questions

你说的曾经没有我的故事 提交于 2019-12-10 17:53:55
问题 I'm porting some ARM NEON code to 64-bit ARM-v8, but I can't find a good documentation about it. Many features seems to be gone, and I don't know how to implement the same function without using them. So, the general question is: where can I find a complete reference for the new SIMD implementation, including explanation of how to do the same simple tasks which are explained in the many ARM-NEON tutorials? Some questions about particular features: 1 - How do I load a value in all the lane of