You have a typical case of cache conflicts.
Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float
) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.
Partition your data better!
#pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)
should do the trick. Increase the chunksize if this does not help.
Another optimisation is to store the result locally before you flush it down to memory.
Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel
starts a new team):
#pragma omp parallel default(none) \
shared(vector, matrix) \
firstprivate(matrix_size) \
num_threads(4)
{
int i, y;
#pragma omp for schedule(static,16)
for(y = 0; y < matrix_size ; y++){
double result = 0;
for(i = 0; i < matrix_size; i++){
results += vector[i]*matrix[i][y];
}
result[y] = result;
}
}