counting L1 cache misses with PAPI_read_counters gives unexpected results

后端 未结 1 447
闹比i
闹比i 2021-01-06 12:22

I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that\'s why I am trying to determine cache hits with no

相关标签:
1条回答
  • 2021-01-06 12:44

    I've done some experiments using LIKWID, which is similar to PAPI, on Haswell. I found out that the calls to the functions that initialize and read the performance counters can cause more than 600 replacements in the L1 cache. Since the L1 cache has only 512 lines, this means that these functions may evict many of the lines that you would otherwise expect to be in the L1. By looking at the relatively large source code of PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*. I think the solution would be to use the RDPMC instruction. I have not used this instruction directly before. The code snippets here look useful.

    Alternatively, you can put two copies of the loop after PAPI_start_counters/PAPI_read_counters and then subtract from the results the counts for one copy of the loop. This method works well.

    By the way, the L1D.REPLACEMENT counter seems to be fairly accurate on Haswell when the number of cache lines accessed is about larger than 10. Perhaps the count would be exact by using RDPMC.


    From your previous question, it seems that you're on Skylake. According to the PAPI event mapping, PAPI_L1_DCM and PAPI_L2_TCM are mapped to L1D.REPLACEMENT and LONGEST_LAT_CACHE.REFERENCE performance monitoring events on Intel processors. These are defined in the Intel manual as follows:

    L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.

    LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.

    Without getting into the details of when these events exactly occur, there are three important points here that are relevant to your question:

    • Both events are counted at the cache-line granularity, not x86 instruction or load uop granularities.
    • These events may occur due to the L1D hardware prefetchers. This can impact miss2.
    • There is no way to count L1D hits at the cache line granularity for a specific physical or logical core using these events (or any other set of events on SnB-based micoarchitectures).

    On Skylake, there are other native events that you can use to count L1D misses and hits per load instruction. You can use MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that hit in the L1D. You can use MEM_INST_RETIRED.ALL_LOADS-MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that miss in the L1D. There doesn't seem to be PAPI events for them. According to the documentation, you can pass native event codes to PAPIF_start_counters.

    Another issue is that it's not clear to me whether PAPIF_start_counters by default will count only user events of both kernel and user events. It seems that you can use PAPI_create_eventset to control the counting domain.

    The calls to PAPI APIs can also impact the event counts. You can try to measure this using an empty block as follows:

    if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
       fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
       exit(1);
    }
    
    // Nothing.
    
    if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
        fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
        exit(1);
    }
    

    This measurement will give you an estimate of the error that may occur due to PAPI itself.

    Also, I don't think you need to use _mm_mfence.

    0 讨论(0)
提交回复
热议问题