My application contains several latency-critical threads that \"spin\", i.e. never blocks. Such thread expected to take 100% of one CPU core. However it seems modern operation s
Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
Unless
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core. The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns
I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count. After reading all the answers, I decided to implement and benchmark 4 different approaches
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.
Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc
instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
rdtsc
instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp
)rdtsc
is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo
, you may get unreliable values across frequency changes and if the task switches cpu (time source)So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?
Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost. There some details in the docs you need to understand.