OpenMP False Sharing
I believe I am experiencing false sharing using OpenMP. Is there any way to identify it and fix it? My code is: https://github.com/wchan/libNN/blob/master/ResilientBackpropagation.hpp line 36. Using a 4 core CPU compared to the single threaded 1 core version yielded only 10% in additional performance. When using a NUMA 32 physical (64 virtual) CPU system, the CPU utilization is stuck at around 1.5 cores, I think this is a direct symptom of false sharing and unable to scale. I also tried running it with Intel VTune profiler, it stated most of the time is spent on the "f()" and "+=" functions. I