Say I have the defacto standard x86 CPU with 3 level of Caches, L1/L2 private, and L3 shared among cores. Is there a way to allocate shared memory whose data will not be cac
I believe you should not (and probably cannot) care, and hope that the shared memory is in L3. BTW, user-space C code runs in virtual address space and your other cores might (and often do) run some other unrelated process.
The hardware and the MMU (which is configured by the kernel) will ensure that L3 is properly shared.
but I'd like to experiment with performance with and without bringing the shared data into private caches.
As far as I understand (quite poorly) recent Intel hardware, this is not possible (at least not in user-land).
Maybe you might consider the PREFETCH
machine instruction and the __builtin_prefetch
GCC builtin (which does the opposite of what you want, it brings data to closer caches). See this and that.
BTW, the kernel does preemptive scheduling, so context switches can happen at any moment (often several hundred times each second). When (at context switch time) another process is scheduled on the same core, the MMU needs to be reconfigured (because each process has its own virtual address space, and the caches are "cold" again).
You might be interested in processor affinity. See sched_setaffinity(2). Read about about Real-Time Linux. See sched(7). And see numa(7).
I am not sure at all that the performance hit you are afraid about is noticable (and I believe it is not avoidable in user-space).
Perhaps you might consider moving your sensitive code in kernel space (so with CPL0 privilege) but that probably requires months of work and is probably not worth the effort. I won't even try.
Have you considered other completely different approaches (e.g. rewriting it in OpenCL for your GPGPU) to your latency sensitive code ?