I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that\'s why I am trying to determine cache hits with no
I've done some experiments using LIKWID, which is similar to PAPI, on Haswell. I found out that the calls to the functions that initialize and read the performance counters can cause more than 600 replacements in the L1 cache. Since the L1 cache has only 512 lines, this means that these functions may evict many of the lines that you would otherwise expect to be in the L1. By looking at the relatively large source code of PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*
. I think the solution would be to use the RDPMC
instruction. I have not used this instruction directly before. The code snippets here look useful.
Alternatively, you can put two copies of the loop after PAPI_start_counters
/PAPI_read_counters
and then subtract from the results the counts for one copy of the loop. This method works well.
By the way, the L1D.REPLACEMENT
counter seems to be fairly accurate on Haswell when the number of cache lines accessed is about larger than 10. Perhaps the count would be exact by using RDPMC
.
From your previous question, it seems that you're on Skylake. According to the PAPI event mapping, PAPI_L1_DCM
and PAPI_L2_TCM
are mapped to L1D.REPLACEMENT
and LONGEST_LAT_CACHE.REFERENCE
performance monitoring events on Intel processors. These are defined in the Intel manual as follows:
L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.
LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.
Without getting into the details of when these events exactly occur, there are three important points here that are relevant to your question:
miss2
.On Skylake, there are other native events that you can use to count L1D misses and hits per load instruction. You can use MEM_LOAD_RETIRED.L1_HIT
to count the number of retired load instructions that hit in the L1D. You can use MEM_INST_RETIRED.ALL_LOADS
-MEM_LOAD_RETIRED.L1_HIT
to count the number of retired load instructions that miss in the L1D. There doesn't seem to be PAPI events for them. According to the documentation, you can pass native event codes to PAPIF_start_counters
.
Another issue is that it's not clear to me whether PAPIF_start_counters
by default will count only user events of both kernel and user events. It seems that you can use PAPI_create_eventset
to control the counting domain.
The calls to PAPI APIs can also impact the event counts. You can try to measure this using an empty block as follows:
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
// Nothing.
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
This measurement will give you an estimate of the error that may occur due to PAPI itself.
Also, I don't think you need to use _mm_mfence
.