neon

C versus vDSP versus NEON - How could NEON be as slow as C?

戏子无情 提交于 2019-12-05 19:44:18
How could NEON be as slow as C? I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I

determinant calculation with SIMD

▼魔方 西西 提交于 2019-12-05 17:18:38
Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This will require smart cross() and dot() implementations. The recipes for these are widely available.

Efficiently compute max of an array of 8 elements in arm neon

时光毁灭记忆、已成空白 提交于 2019-12-05 14:26:13
How do I find max element in array of 8 bytes, 8 shorts or 8 ints? I may need just the position of the max element, value of the max element, or both of them. For example : unsigned FindMax8(const uint32_t src[8]) // returns position of max element { unsigned ret = 0; for (unsigned i=0; i<8; ++i) { if (src[i] > src[ret]) ret = i; } return ret; } At -O2 clang unrolls the loop but it does not use neon, which should give decent perf boost (because it eliminates many data dependent branches?) For 8 bytes and 8 shorts approach should be simpler as entire array can be loaded into a single q-register

ARM NEON assembly on Windows Phone 8 not working

纵饮孤独 提交于 2019-12-05 11:25:16
I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this: AREA myfunction, code, readonly, ARM global fun align 4 fun push {r4, r5, r6, r7, lr} add r7, sp, #12 push {r8, r10, r11} sub r4, sp, #64 bic r4, r4, #15 mov sp, r4 vst1.64 {d8, d9, d10, d11}, [r4]! vst1.64 {d12, d13, d14, d15}, [r4] [....] and I'm assembling it like this: armasm.exe -32 func.s func.obj Unfortunately this doesn't work, and I'm getting illegal instruction exception when I try and call the function. When I used dumpbin.exe to disassemble the .obj, it seem to be disassembling as

11月NEO技术社区开发进展汇总

我们两清 提交于 2019-12-05 10:11:21
为了帮助大家了解NEO平台上技术社区的开发进展,NEONewsToday将每月发布一份值得关注的更新报告。这些报告将包括对NEO核心项目的贡献以及对社区创建项目的改进。 这个报告不是包括所有项目进展的详细清单。NEONewsToday将从尽可能多的社区贡献者中收集信息,但并不能完全包含所有社区项目内容。 任何对NEO基础设施或开发工具做出重大贡献的NEO开发者(无论是开发社区的成员还是其他人),都可以通过wakeup@neonewstoday.com与NEONews Today 联系,并提供相关信息以供将来报告使用。 NEO协议贡献 Neo-cli(NR) 自10月24日以来,NeoResearch成员Igor和Vitor Coelho一直致力于一项旨在优化NEO共识机制的重大更新。该提案中还看到了CoZ和NGD成员的贡献和评论。 https://github.com/neo-project/neo/pull/426 此项更新的第一部分由PR #426涵盖,重点介绍如何添加“提交”阶段的共识,以防止“分叉”问题(移植到与Akka模型兼容的地方),以及更新策略和其他性能优化。再生策略的目的是允许丢失/失败的共识节点自动重新与网络连接,无需重新启动。 https://github.com/neo-project/neo/pull/422 初步的修改已经完成

Sum all elements in a quadword vector in ARM assembly with NEON

我只是一个虾纸丫 提交于 2019-12-05 09:25:09
Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure. It seems that you want to get the sum of a certain length of array, and not only four float values. In that case, your code will work, but is far from optimized : many many pipeline interlocks unnecessary 32bit addition per iteration Assuming the

How to enable Neon instruction in Xcode

℡╲_俬逩灬. 提交于 2019-12-05 09:04:40
I want to use Neon SIMD instruction for the iphone. I heard we have to put flags "-mfloat-abi=softfp -mfpu=neon" in the "Other C Flags" field of the Target inspector, but when building I get "error: unrecognized command line option "-mfpu=neon"" . Is there anything else special that has to be done to allow this flag? (I have Xcode 3.2.1 and iphone sdk 3.1.3) Thanks !! The NEON set is an extension on the Cortex-A series, therefore not supported in iPhone 3G. You probably cannot specify this directly. NEON is enabled by default. The target has to be ARMv7 for that. (3GS or later) In order to

NEON vectorize sum of products of unsigned bytes: (a[i]-int1) * (b[i]-int2)

一笑奈何 提交于 2019-12-05 02:05:16
问题 I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin. Assumptions / pre-conditions: w is always 320 (multiple of 16/32). pa and pb are 16-byte aligned ma and mb are positive. int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w) { int sum=0; do { sum += ((*pa++)-ma)*((*pb++)-mb); } while(--w); return sum; } This attempt at vectorizing it is not working well,

ARM NEON vectorization failure

半腔热情 提交于 2019-12-04 23:47:18
问题 I would like to enable NEON vectorization on my ARM cortex-a9, but I get this output at compile: "not vectorized: relevant stmt not supported: D.14140_82 = D.14143_77 * D.14141_81" Here is my loop: void my_mul(float32_t * __restrict data1, float32_t * __restrict data2, float32_t * __restrict out){ for(int i=0; i<SIZE*4; i+=1){ out[i] = data1[i]*data2[i]; } } And the options used at compile: -march=armv7-a -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize -mvectorize-with-neon

ARM NEON: How to implement a 256bytes Look Up table

陌路散爱 提交于 2019-12-04 21:26:13
问题 I am porting some code I wrote to NEON using inline assembly. One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255] The table is short but the math behind this is not easy so I think it is not worth calculating it each time "on the fly". So I want to try Look Up tables. I have used VTBL for a 32byte case, and works as expected For the full range, one idea would be to first compare the range where the source is