问题
I have basically two vectors one for a large number of elements and a second for a small number of probes used to sample data of the elements. I stumbled upon the question in which order to implement the two loops. Naturally I thought having the outer loop over the larger vector would be beneficially
Implementation 1:
for(auto& elem: elements) {
for(auto& probe: probes) {
probe.insertParticleData(elem);
}
}
However it seems that the second implementation takes only half of the time
Implementation 2:
for(auto& probe: probes) {
for(auto& elem: elements) {
probe.insertParticleData(elem);
}
}
What could be the reason for that?
Edit:
Timings were generated by the following code
clock_t t_begin_ps = std::clock();
... // timed code
clock_t t_end_ps = std::clock();
double elapsed_secs_ps = double(t_end_ps - t_begin_ps) / CLOCKS_PER_SEC;
and on inserting the elements data I do basically two things, testing if the distance to the probe is below a limit and the computing an average
probe::insertParticleData (const elem& pP) {
if (!isInside(pP.position())) {return false;}
... // compute alpha and beta
avg_vel = alpha*avg_vel + beta*pP.getVel();
return true;
}
To get an idea of the memory usage I have approx. 10k elements which are objects with 30 double data members. For the test I used 10 probes containing 15 doubles.
回答1:
Todays CPUs are heavily optimized for linear access to memory. Therefore a few long loops will beat many short loops. You want the inner loop to iterate over the long vector.
回答2:
My guess: if insertParticleData is virtual, the compiler will treat the function's address as a constant within the inner loop and move the vtable fetch outside the inner loop. I.e. effectively generate code which looks like:
for (auto& probe: probes) {
funcPtr p = probe.insertParticleData;
for (auto& elem: elements) {
(*p)(elem);
}
}
whereas in the first version, p would be computed for every inner iteration.
来源:https://stackoverflow.com/questions/27143919/c-nested-loop-performance