cortex-a8 | 易学教程

ELF loading when VMA != LMA

阅读更多关于 ELF loading when VMA != LMA

问题 I have a problem on this one. I am using ARM Cortex-A9 with DS-5 to create baremetal firmware. I modified my linker file to intentionally put the .data section LMA adjacent to the text and rodata sections, because its default run-time VMA is located 1MB away and the .bin image is around 1MB but containing 90% zeroes. And so I intentionally made LMA != VMA to save space. I also added a code in start.S that relocates the .data section from its lma to vma. However on loading the resulting elf

Efficient floating point comparison (Cortex-A8)

阅读更多关于 Efficient floating point comparison (Cortex-A8)

问题 There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point). The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time (~20 cycles in accordance to a profiler). Is there any efficient way to compare these values? NOTE: As rounding error doesn't matter, I tried the following: float arr[10000]; float threshold; .... int a = arr[20]; // e.g. int t = threshold; if (t > a

Why u-boot can put global data's address into r9 register?

阅读更多关于 Why u-boot can put global data's address into r9 register?

问题 When i look through u-boot source code, i found that it pass global data through r9 register like this register volatile gd_t *gd asm ("r9") So, i'm curious, how does u-boot ensure further codes won't use r9 register and corrupt the global data. Is there an options to tell compiler not to use specific register? 回答1: From Procedure Call Standard for the ARM Architecture: The role of register r9 is platform specific. A virtual platform may assign any role to this register and must document this

Optimizing Cortex-A8 color conversion using NEON

阅读更多关于 Optimizing Cortex-A8 color conversion using NEON

问题 I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses. void convert_hd(uint8_t *orig, uint8_t *result) { uint32_t width = 1280; uint32_t height = 720; uint8_t *lineOdd = orig; uint8_t *lineEven = orig + width*2; uint8_t *resultYOdd = result; uint8_t *resultYEven = result + width; uint8_t *resultUV = result + height*width; uint32_t totalLoop = height/2; while

[ARM CortexA]Difference between Strongly-ordered and Device Memory Type

阅读更多关于 [ARM CortexA]Difference between Strongly-ordered and Device Memory Type

问题 I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types: Strongly-ordered Device Normal I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me. According to the Cortex-A Series Programmer's Guide, the only difference is that: A write to Strongly-ordered memory can complete only when it reaches the

How to get call graph profiling working with gcc compiled code and ARM Cortex A8 target?

阅读更多关于 How to get call graph profiling working with gcc compiled code and ARM Cortex A8 target?

I am biting my teeth out on this one... I need to do profiling on an ARM board and need to view call graphs. I tried with OProfile, Kernel perf and Google performance tools. All work fine but do not output any call-graph information. This led me to the conclusion that I am not compiling my code correctly. I use the following flags when compiling my C++ code: Arch specific: -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=vfpv3 General: -fexceptions -fno-strict-aliasing -D_REENTRANT -Wall -Wextra Debugging (with optimization): -O2 -g -fno-omit-frame-pointer I did a lot of Google searching

Using ARM NEON intrinsics to add alpha and permute

阅读更多关于 Using ARM NEON intrinsics to add alpha and permute

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1] = rgb.val[1]; //these lines are slow bgra.val[2] = rgb.val[0]; //these lines are slow bgra.val[3] =

[ARM CortexA]Difference between Strongly-ordered and Device Memory Type

阅读更多关于 [ARM CortexA]Difference between Strongly-ordered and Device Memory Type

I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types: Strongly-ordered Device Normal I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me. According to the Cortex-A Series Programmer's Guide, the only difference is that: A write to Strongly-ordered memory can complete only when it reaches the peripheral or memory component accessed by the write. A write to Device memory is permitted to

ARM Cortex A8 PMNC read gives 0 after enabling also.. Any Idea/Suggestions?

阅读更多关于 ARM Cortex A8 PMNC read gives 0 after enabling also.. Any Idea/Suggestions?

问题 MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("user-mode access to performance registers"); int __init arm_init(void) { unsigned int value; /* enable user-mode access */ printk(KERN_INFO "enable user-mode access\n"); asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1)); /* Reading the value here--just to check */ asm ("MRC p15, 0, %0, c9, c14, 0\t\n": "=r"(value)); printk("value: %d\n", value); /* disable counter overflow interrupts (just in case)*/ printk(KERN_INFO "disable counter overflow

Checksum code implementation for Neon in Intrinsics

阅读更多关于 Checksum code implementation for Neon in Intrinsics

问题 I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM. My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number. Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version. ARM version takes: 0.860000 s NEON version takes: