Why would parallelization decrease performance so dramatically?

后端未结

关注

 2  1857

小蘑菇 2021-02-15 01:38

I have an OpenMP program (thousands of lines, impossible to reproduce here) that works as follows:

It consists of worker threads along with a task queue.
A task cons

2条回答

被撕碎了的回忆 (楼主)

2021-02-15 02:19
I have two suggestions:

1.) On NUMA systems you want to make sure that buffers you write to are aligned to page boundaries and as well are multiples of a page. Pages are usually 4096 bytes. If a buffer is split between pages you get false sharing.

http://dl.acm.org/citation.cfm?id=1295483

False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations.

and this link https://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf

...false sharing which occurs when several independent objects which may have different access patterns are allocated to same unit of movable memory (in our case, a page of virutal memory).

So for example if an array is 5000 bytes you should make it 8192 bytes (2*4096). Then allign it with something like
```
float* array = (float*)_mm_malloc(8192, 4096);  //two pages both aligned to a page
```
On non NUMA systems you don't want multiple threads to write to the same cache line (usually 64 bytes). This causes false sharing. On NUMA systems you don't want multiple threads writing to the same page (usually 4096 bytes).

See some of the comments here Fill histograms (array reduction) in parallel with OpenMP without using a critical section

2.) OpenMP can migrate the threads to different cores/processors so you may want to bind the threads to certain cores/processors. You can do this with ICC and GCC. With GCC I think you want to do something like GOMP_CPU_AFFINITY=0 2 4... See this link What limits scaling in this simple OpenMP program?
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...