Since you (apparently) care about doing this fast, you might also consider trying to multi-thread the computation to take advantage of all available cores. I did a pretty trivial rewrite of your naive loop to use OpenMP, giving this:
timer.restart();
sum = 0;
// only real change is adding the following line:
#pragma omp parallel for schedule(dynamic, 4096), reduction(+:sum)
for (int i = 0; i < num_samples; i++) {
sum += samples[i];
}
result = timer.elapsed();
std::cout << "OMP:\t\t" << result << ", sum = " << sum << std::endl;
Just for grins, I also did a little rewriting on your unrolled loop to allow semi-arbitrary unrolling, and adding OpenMP as well:
static const int unroll = 32;
real total = real();
timer.restart();
double sum[unroll] = { 0.0f };
#pragma omp parallel for reduction(+:total) schedule(dynamic, 4096)
for (int i = 0; i < num_samples; i += unroll) {
for (int j = 0; j < unroll; j++)
total += samples[i + j];
}
result = timer.elapsed();
std::cout << "ILP+OMP:\t" << result << ", sum = " << total << std::endl;
I also increased the array size (substantially) to get somewhat more meaningful numbers. The results were as follows. First for a dual-core AMD:
rewrite of 4096 Mb takes 8269023193
naive: 3336194526, sum = 536870912
pointers: 3348790101, sum = 536870912
algorithm: 3293786903, sum = 536870912
ILP: 2713824079, sum = 536870912
OMP: 1885895124, sum = 536870912
ILP+OMP: 1618134382, sum = 536870912
Then for a quad-core (Intel i7):
rewrite of 4096 Mb takes 2415836465
naive: 1382962075, sum = 536870912
pointers: 1675826109, sum = 536870912
algorithm: 1748990122, sum = 536870912
ILP: 751649497, sum = 536870912
OMP: 575595251, sum = 536870912
ILP+OMP: 450832023, sum = 536870912
From the looks of things, the OpenMP versions are probably hitting limitations on memory bandwidth--the OpenMP versions make more use of the CPU than the un-threaded versions, but still only get to around 70% or so, indicating some other than the CPU is acting as a bottleneck.