Debug data/neon performance hazards in arm neon code
Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better: typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]); CalcMaxFunc CalcMaxFuncs[] = { CalcMaxFunc_NEON_0, CalcMaxFunc_NEON_1, CalcMaxFunc_NEON_2, CalcMaxFunc_NEON_3, CalcMaxFunc_C_0 }; int N =