Can some explain the performance behavior of the following memory allocating C program?

前端 未结 5 677
半阙折子戏
半阙折子戏 2021-02-01 08:36

On my machine Time A and Time B swap depending on whether A is defined or not (which changes the order in which the two callocs are called).

I

5条回答
  •  一生所求
    2021-02-01 08:42

    Short Answer

    The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.

    Details

    Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:

    1. Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.

    2. Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.

    3. Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.

    4. I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:

    Strace output:

    21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
    21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
    21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>
    

    You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.

    I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.

    As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.

提交回复
热议问题