I want to estimate the performance overhead due to TLB misses on a x86-64 (Intel Nehalem) machine running Linux. I wish to get this estimate by using some performance counters.
If you can get access to a "Westmere" based system the performance characteristics of your code should be quite similar to what you have on the "Nehalem", but you will have access to a new hardware performance counter event that measures almost exactly what you want.
On Westmere, the best estimate of performance lost while waiting for TLB misses to be handled is probably from the hardware performance counter Event 08H, Mask 04H "DTLB_LOAD_MISSES.WALK_CYCLES", which is described as counting "Cycles Page Miss Handler is busy with a page walk due to a load miss in the Second Level TLB". This is described in "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2" (document number: 253669), available online at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html
The reason this event is necessary is that TLB miss processing time is dominated by the time required to read the cache line containing the page table entry. If that cache line is in the L2 cache, then the overhead of a TLB misses will be very small (of the order of 10 cycles). If the line is in the L3 cache, then maybe 25 cycles. If the line is in memory, then ~200 cycles.