I\'m trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a
To achieve your goals, you're best off just writing this:
template T max(T a, T b) { return (a > b) ? a : b; }
I implemented both the "naive" implementation of max()
as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.
Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.
Admittedly, my assembly-fu isn't all that great, so while NaiveMax()
had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.
So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.
#include
#include
using int32 = long;
using uint32 = unsigned long;
constexpr int32 NaiveMax(int32 a, int32 b)
{
return (a > b) ? a : b;
}
constexpr int32 FastMax(int32 a, int32 b)
{
int32 mask = a - b;
mask = mask >> ((sizeof(int32) * 8) - 1);
return a + ((b - a) & mask);
}
int main()
{
int32 resInts[1000] = {};
int32 lotsOfInts[1'000];
for (uint32 i = 0; i < 1000; i++)
{
lotsOfInts[i] = rand();
}
auto naiveTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast(finish - start).count();
}();
auto fastTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast(finish - start).count();
}();
std::cout << "Naive Time: " << naiveTime << std::endl;
std::cout << "Fast Time: " << fastTime << std::endl;
getchar();
return 0;
}
And here's the output I get on my machine:
Naive Time: 2330174
Fast Time: 2492246
I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.
Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.
In brief, it would seem that the best way to write a branchless templated max()
function is probably to keep it simple:
template T max(T a, T b) { return (a > b) ? a : b; }
There are additional upsides to the naive method: