Now that C++ is adding thread_local
storage as a language feature, I\'m wondering a few things:
thead_local
likely to
A description for how it works on Linux by Uli Drepper (maintainer of glibc) can be found here: www.akkadia.org/drepper/tls.pdf
The requirement to handle dynamically loaded modules etc. make the entire mechanism a bit convoluted, which perhaps partly explains why the document weights in at 79 pages (!).
Memory-usage-wise, each per-thread variable obviously needs it's own per-thread memory (although in some cases this can be done lazily such that the space is allocated only once the variable is first accessed), and then there's some extra datastructures that are needed for offset tables etc.
Performance-wise, the extra cost to access a TLS variable mostly revolves around retrieving the address of the variable. On x86 Linux the GS register is used as a start to get a thread id, on x86-64 FS. Usually there is a few pointer dereferences, and a function call (__tls_get_addr) for dynamically loaded code. There's also the cost that creating a new thread is slower because the implementation needs to allocate space and possibly initialize all the TLS vars (if not done lazily).
TLS is nice for easily making some old thread-unsafe code patterns thread-safe (think errno), but for new code designed from the start for a multi-threaded world it's very seldom needed.
Storage space: size of the variable * number of threads, or possibly (sizeof(var) + sizeof(var*)) * number of threads.
There are two basic ways of implementing thread-local storage:
Using some sort of system call that gets information about the current kernel thread. Sloooow.
Using some pointer, probably in a processor register, that is set properly at every thread context switch by the kernel - at the same time as all the other registers. Cheap.
On intel platforms, variant 2 is usually implemented via some segment register (FS or GS, I don't remember). Both GCC and MSVC support this. Access times are therefore about as fast as for global variables.
It is also possible, but I haven't seen it yet in practice, for this to be implemented via existing library functions like pthread_getspecific
. Performance would then be like 1. or 2., plus library call overhead. Keep in mind that variant 2. + library call overhead is still a lot faster than a kernel call.