should I “bind” “spinning” thread to the certain core?

后端 未结 5 1411
难免孤独
难免孤独 2021-02-03 11:32

My application contains several latency-critical threads that \"spin\", i.e. never blocks. Such thread expected to take 100% of one CPU core. However it seems modern operation s

5条回答
  •  长发绾君心
    2021-02-03 11:53

    Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.

    When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.

    One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.

    The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.

    volatile uint64_t rdtsc() {
       register uint32_t eax, edx;
       asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
       return ((uint64_t) edx << 32) | (uint64_t) eax;
    }
    
    • note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
    • also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)

    So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.

    Some additional reading...

    • Intel 64 Architecture Processor Topology Enumeration
    • What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
    • Intel Software Developer Reference (Vol. 2A/2B)
    • Aquire and Release Fences
    • TCMalloc

提交回复
热议问题