问题
I would like to take advantage of OpenMP to make my task parallel.
I need to subtract the same quantity to all the elements of an array and write the result in another vector. Both arrays are dynamically allocated with malloc
and the first one is filled with values from a file. Each element is of type uint64_t
.
#pragma omp parallel for
for (uint64_t i = 0; i < size; ++i) {
new_vec[i] = vec[i] - shift;
}
Where shift
is the fixed value I want to remove from every element of vec
. size
is the length of both vec
and new_vec
, which is approximately 200k.
I compile the code with g++ -fopenmp
on Arch Linux. I'm on an Intel Core i7-6700HQ, and I use 8 threads. The running time is 5 to 6 times higher when I use the OpenMP version. I can see that all the cores are working when I run the OpenMP version.
I think this might be caused by a False Sharing issue, but I can't find it.
回答1:
You should adjust how the iterations are split among the threads. With schedule(static,chunk_size)
you are able to do so.
Try to use chunk_size values multiples of 64/sizeof(uint64_t) to avoid the said false sharing:
[ cache line n ][ cache line n+1 ]
[ chuhk 0 ][ chunk 1 ][ chunk 2 ]
And achieve something like this:
[ cache line n ][ cache line n+1 ][ cache line n+2 ][...]
[ chunk 0 ][ chunk 1 ]
You also should allocate your vectors in such a way that they are aligned to cache lines. That way you ensure that the first and subsequent chunks are properly aligned as well.
#define CACHE_LINE_SIZE sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
uint64_t *vec = aligned_alloc( CACHE_LINE_SIZE/*alignment*/, 200000 * sizeof(uint64_t)/*size*/);
Your problem is really similar to what Stream Triad benchmark represents. Check out how to optimize that benchmark and you will be able to map almost exactly the optimizations on your code.
来源:https://stackoverflow.com/questions/45032586/false-sharing-in-openmp-loop-array-access