I have an OpenMP program (thousands of lines, impossible to reproduce here) that works as follows:
It consists of worker threads along with a task queue.
A task cons
I have two suggestions:
1.) On NUMA systems you want to make sure that buffers you write to are aligned to page boundaries and as well are multiples of a page. Pages are usually 4096 bytes. If a buffer is split between pages you get false sharing.
http://dl.acm.org/citation.cfm?id=1295483
False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations.
and this link https://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf
...false sharing which occurs when several independent objects which may have different access patterns are allocated to same unit of movable memory (in our case, a page of virutal memory).
So for example if an array is 5000 bytes you should make it 8192 bytes (2*4096). Then allign it with something like
float* array = (float*)_mm_malloc(8192, 4096); //two pages both aligned to a page
On non NUMA systems you don't want multiple threads to write to the same cache line (usually 64 bytes). This causes false sharing. On NUMA systems you don't want multiple threads writing to the same page (usually 4096 bytes).
See some of the comments here Fill histograms (array reduction) in parallel with OpenMP without using a critical section
2.) OpenMP can migrate the threads to different cores/processors so you may want to bind the threads to certain cores/processors. You can do this with ICC and GCC. With GCC I think you want to do something like GOMP_CPU_AFFINITY=0 2 4...
See this link What limits scaling in this simple OpenMP program?