ARM performance counters vs linux clock_gettime

微笑、不失礼 提交于 2019-12-21 05:14:18

问题


I am using a Zynq chip on a development board ( ZC702 ) , which has a dual cortex-A9 MPCore at 667MHz and comes with a Linux kernel 3.3 I wanted to compare the execution time of a program so first a used clock_gettime and then used the counters provided by the co-processor of ARM. The counter increment every one processor cycle. ( based on this question of stackoverflow and this)

I compile the program with -O0 flag ( since I don't want any reordering or optimization done)

The time I measure with the performance counters is 583833498 ( cycles ) / 666.666687 MHz = 875750.221 (microseconds)

While using clock_gettime() ( either REALTIME or MONOTONIC or MONOTONIC_RAW ) the time measured is : 731627.126 ( microseconds) which is 150000 microseconds less..

Can anybody explain me why is this happening? Why is there a difference? The processor does not clock-scale , how is it possible to get less execution time measured by clock_gettime ? I have a sample code below:


#define RUNS 50000000
#define BENCHMARK(val) \
__asm__  __volatile__("mov r4, %1\n\t" \
                 "mov r5, #0\n\t" \
                 "1:\n\t"\
                 "add r5,r5,r4\n\t"\
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "sub r4,r4,#1\n\t" \
                 "cmp r4, #0\n\t" \
                 "bne 1b\n\t" \
                 "mov %0 ,r5  \n\t" \
                 :"=r" (val) \
                 : "r" (RUNS) \
                 : "r4","r5" \
        );
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(start_cycles));
for(index=0;index<5;index++)
{
    BENCHMARK(i);
}
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(end_cycles));
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);

回答1:


I found the solution. I upgraded the platform from a linux kernel 3.3.0 to 3.5 and the value is similar to that of the performance counters. Apparently the frequency of the clock counter in 3.3.0 is assumed higher that what it is ( around 400 MHz ) instead of half of the CPU frequency. Probably a porting error in the old version.




回答2:


The POSIX clocks operate within certain precision, which you can get with clock_getres. Check if that 150,000us difference is inside or outside the error margin.

In any case, it shouldn't matter, you should repeat you benchmark many times, not 5, but 1000 or more. You can then get the timing of a single benchmark run like

((end + e1) - (start + e0)) / 1000, or

(end - start) / 1000 + (e1 - e0) / 1000.

If e1 and e0 are the error terms, which are bound by a small constant, your maximum measurement error will be abs (e1 - e0) / 1000, which will be negligible as the number of loops increase.



来源:https://stackoverflow.com/questions/13474000/arm-performance-counters-vs-linux-clock-gettime

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!