问题
Please look at this code.
Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:
g++ -lrt -O2 main.cpp -o nnlv2
Multithread with openMP: http://pastebin.com/fbe4gZSn Compiled with:
g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp
I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?
Some tests:
Single-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 1898983
10 500 500 --- 11009094
10 1000 1000 --- 48116913
Multi-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 2518262
10 500 500 --- 13861504
10 1000 1000 --- 53446849
I don't understand what is wrong.
回答1:
Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.
Step 1: Combine loops and use multiply-add:
// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
for(int j=0;j<NEURONS;j++)
{
outputs[j] = 0;
for(int k=0,l=0;l<INPUTS;l++,k++)
{
outputs[j] += inputs[l] * weights[i][k];
}
outputs[j] = sigmoid(outputs[j]);
}
std::swap(inputs, outputs);
}
回答2:
compiling with -static and -p, running and then parsing gmon.out with gprof I got:
45.65% gomp_barrier_wait_end
That's a lot of time in opemmp's barrier routine. that is the time spent waiting for the other threads to finish. since you're running the parallel for loops many times (LAYERS), you loose the advantage of running in parallel since every time a parallel for loop is finished, there is an implicit barrier call which won't return till all other threads finish.
回答3:
Before anything else, run the test on Multi-thread configuration and MAKE SURE that procexp or task manager will show you 100% CPU usage for it. If it doesn't, then you don't use multiple threads nor multiple processor cores.
Also, taken from wiki:
Environment variables
A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.
回答4:
I don't see where you have actually used OpenMP - try #pragma omp parallel for above the main loop... (documented here, for example)
The slowness is possibly from including OpenMP and it initialising, adding code bloat or otherwise changing the compilation as a result of the compiler flags you introduced to enable it. Alternatively the loops are so small and simple that the overhead of threading far exceeds the performance gain.
来源:https://stackoverflow.com/questions/6671448/why-is-this-openmp-program-slower-than-single-thread