cortex-a8

ELF loading when VMA != LMA

旧街凉风 提交于 2019-12-14 02:25:44
问题 I have a problem on this one. I am using ARM Cortex-A9 with DS-5 to create baremetal firmware. I modified my linker file to intentionally put the .data section LMA adjacent to the text and rodata sections, because its default run-time VMA is located 1MB away and the .bin image is around 1MB but containing 90% zeroes. And so I intentionally made LMA != VMA to save space. I also added a code in start.S that relocates the .data section from its lma to vma. However on loading the resulting elf

Efficient floating point comparison (Cortex-A8)

时光总嘲笑我的痴心妄想 提交于 2019-12-12 08:49:12
问题 There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point). The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time (~20 cycles in accordance to a profiler). Is there any efficient way to compare these values? NOTE: As rounding error doesn't matter, I tried the following: float arr[10000]; float threshold; .... int a = arr[20]; // e.g. int t = threshold; if (t > a

Why u-boot can put global data's address into r9 register?

▼魔方 西西 提交于 2019-12-11 13:42:04
问题 When i look through u-boot source code, i found that it pass global data through r9 register like this register volatile gd_t *gd asm ("r9") So, i'm curious, how does u-boot ensure further codes won't use r9 register and corrupt the global data. Is there an options to tell compiler not to use specific register? 回答1: From Procedure Call Standard for the ARM Architecture: The role of register r9 is platform specific. A virtual platform may assign any role to this register and must document this

Optimizing Cortex-A8 color conversion using NEON

社会主义新天地 提交于 2019-12-07 05:47:54
问题 I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses. void convert_hd(uint8_t *orig, uint8_t *result) { uint32_t width = 1280; uint32_t height = 720; uint8_t *lineOdd = orig; uint8_t *lineEven = orig + width*2; uint8_t *resultYOdd = result; uint8_t *resultYEven = result + width; uint8_t *resultUV = result + height*width; uint32_t totalLoop = height/2; while

[ARM CortexA]Difference between Strongly-ordered and Device Memory Type

…衆ロ難τιáo~ 提交于 2019-12-04 20:17:33
问题 I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types: Strongly-ordered Device Normal I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me. According to the Cortex-A Series Programmer's Guide, the only difference is that: A write to Strongly-ordered memory can complete only when it reaches the

How to get call graph profiling working with gcc compiled code and ARM Cortex A8 target?

妖精的绣舞 提交于 2019-12-04 09:03:26
I am biting my teeth out on this one... I need to do profiling on an ARM board and need to view call graphs. I tried with OProfile, Kernel perf and Google performance tools. All work fine but do not output any call-graph information. This led me to the conclusion that I am not compiling my code correctly. I use the following flags when compiling my C++ code: Arch specific: -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=vfpv3 General: -fexceptions -fno-strict-aliasing -D_REENTRANT -Wall -Wextra Debugging (with optimization): -O2 -g -fno-omit-frame-pointer I did a lot of Google searching

Using ARM NEON intrinsics to add alpha and permute

别来无恙 提交于 2019-12-04 08:24:05
I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1] = rgb.val[1]; //these lines are slow bgra.val[2] = rgb.val[0]; //these lines are slow bgra.val[3] =

[ARM CortexA]Difference between Strongly-ordered and Device Memory Type

混江龙づ霸主 提交于 2019-12-03 13:14:45
I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types: Strongly-ordered Device Normal I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me. According to the Cortex-A Series Programmer's Guide, the only difference is that: A write to Strongly-ordered memory can complete only when it reaches the peripheral or memory component accessed by the write. A write to Device memory is permitted to

ARM Cortex A8 PMNC read gives 0 after enabling also.. Any Idea/Suggestions?

跟風遠走 提交于 2019-12-02 15:03:25
问题 MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("user-mode access to performance registers"); int __init arm_init(void) { unsigned int value; /* enable user-mode access */ printk(KERN_INFO "enable user-mode access\n"); asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1)); /* Reading the value here--just to check */ asm ("MRC p15, 0, %0, c9, c14, 0\t\n": "=r"(value)); printk("value: %d\n", value); /* disable counter overflow interrupts (just in case)*/ printk(KERN_INFO "disable counter overflow

Checksum code implementation for Neon in Intrinsics

半腔热情 提交于 2019-12-02 11:14:10
问题 I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM. My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number. Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version. ARM version takes: 0.860000 s NEON version takes: