I need to implement reduction operation (for each thread the value should be stored in different array entry). However, it runs slower for more threads. Any suggestions?
Did you try to use reduction?
double global_sum = 0.0;
#pragma omp parallel for shared(h,n,a) reduction(+:global_sum)
for (i = 1; i < n; i++) {
global_sum += f(a + i* h);
}
Howerver there may be a lot of other reasons why it runs slow. For example you should not create 16 threads if you have only 2 CPU cores and so on.