I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much
Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[]
array fit in the same cache line. When thread i
writes to result[i]
as a result of the +=
operator, the cache line holding that part of result[]
becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result
is an array of long long
, then one cache line (64 bytes on x86) holds 8 elements and besides result[i]
there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).
To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8)
where something
should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size
equal to 999 and 4 threads you would apply the schedule(static,256)
clause to the parallel for
construct.
Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush
construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile
if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i]
is only assigned to and the value of result[i]
is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i]
once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i]
in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize
does the temporary variable trick with both OpenMP enabled and not.