I\'m currently profiling an application with performance problems using Valgrind\'s \"Callgrind\". In looking at the profiling data, it appears that a good 25% of processing tim
Obtaining thread local data will most probably involve a system call. System calls jump to an interrupt vector as well as now having to read kernel memory. All this kills the cache.
For this reason reading thread local data can much longer than a normal variable read. For this reason is may well be a good idea to cache thread local data some local variable an not make frequent accesses to thread local storage.