问题
I am currently writing some glsl like vector math classes in C++, and I just implemented an abs()
function like this:
template<class T>
static inline T abs(T _a)
{
return _a < 0 ? -_a : _a;
}
I compared its speed to the default C++ abs
from math.h
like this:
clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = abs(-1.25);
};
clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;
Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h
abs
works internally? The performance difference is nothing dramatic, but I am just curious!
回答1:
Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double
and can play tricks with that instead.
Likely (as in almost not even a question), your double
is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:
template <>
double abs<double>(const double x)
{
// breaks strict aliasing, but compiler writer knows this behavior for the platform
uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit
return reinterpret_cast<const double&>(i);
}
This removes branching and may run faster.
回答2:
There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.
int my_abs(int x)
{
int s = x >> 31;
return (x ^ s) - s;
}
回答3:
What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.
The following line:
printf("%.3f", abs(1.25));
falls into the following "fabs" code path (in msvcr90d.dll):
004113DE sub esp,8
004113E1 fld qword ptr [__real@3ff4000000000000 (415748h)]
004113E7 fstp qword ptr [esp]
004113EA call abs (4110FFh)
abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):
102F5730 mov edi,edi
102F5732 push ebp
102F5733 mov ebp,esp
102F5735 sub esp,14h
102F5738 fldz
102F573A fstp qword ptr [result]
102F573D push 0FFFFh
102F5742 push 133Fh
102F5747 call _ctrlfp (102F6140h)
102F574C add esp,8
102F574F mov dword ptr [savedcw],eax
102F5752 movzx eax,word ptr [ebp+0Eh]
102F5756 and eax,7FF0h
102F575B cmp eax,7FF0h
102F5760 jne fabs+0D2h (102F5802h)
102F5766 sub esp,8
102F5769 fld qword ptr [x]
102F576C fstp qword ptr [esp]
102F576F call _sptype (102F9710h)
102F5774 add esp,8
102F5777 mov dword ptr [ebp-14h],eax
102F577A cmp dword ptr [ebp-14h],1
102F577E je fabs+5Eh (102F578Eh)
102F5780 cmp dword ptr [ebp-14h],2
102F5784 je fabs+77h (102F57A7h)
102F5786 cmp dword ptr [ebp-14h],3
102F578A je fabs+8Fh (102F57BFh)
102F578C jmp fabs+0A8h (102F57D8h)
102F578E push 0FFFFh
102F5793 mov ecx,dword ptr [savedcw]
102F5796 push ecx
102F5797 call _ctrlfp (102F6140h)
102F579C add esp,8
102F579F fld qword ptr [x]
102F57A2 jmp fabs+0F8h (102F5828h)
102F57A7 push 0FFFFh
102F57AC mov edx,dword ptr [savedcw]
102F57AF push edx
102F57B0 call _ctrlfp (102F6140h)
102F57B5 add esp,8
102F57B8 fld qword ptr [x]
102F57BB fchs
102F57BD jmp fabs+0F8h (102F5828h)
102F57BF mov eax,dword ptr [savedcw]
102F57C2 push eax
102F57C3 sub esp,8
102F57C6 fld qword ptr [x]
102F57C9 fstp qword ptr [esp]
102F57CC push 15h
102F57CE call _handle_qnan1 (102F98C0h)
102F57D3 add esp,10h
102F57D6 jmp fabs+0F8h (102F5828h)
102F57D8 mov ecx,dword ptr [savedcw]
102F57DB push ecx
102F57DC fld qword ptr [x]
102F57DF fadd qword ptr [__real@3ff0000000000000 (1022CF68h)]
102F57E5 sub esp,8
102F57E8 fstp qword ptr [esp]
102F57EB sub esp,8
102F57EE fld qword ptr [x]
102F57F1 fstp qword ptr [esp]
102F57F4 push 15h
102F57F6 push 8
102F57F8 call _except1 (102F99B0h)
102F57FD add esp,1Ch
102F5800 jmp fabs+0F8h (102F5828h)
102F5802 mov edx,dword ptr [ebp+0Ch]
102F5805 and edx,7FFFFFFFh
102F580B mov dword ptr [ebp-0Ch],edx
102F580E mov eax,dword ptr [x]
102F5811 mov dword ptr [result],eax
102F5814 push 0FFFFh
102F5819 mov ecx,dword ptr [savedcw]
102F581C push ecx
102F581D call _ctrlfp (102F6140h)
102F5822 add esp,8
102F5825 fld qword ptr [result]
102F5828 mov esp,ebp
102F582A pop ebp
102F582B ret
In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:
00401006 fld qword ptr [__real@3ff4000000000000 (402100h)]
0040100C sub esp,8
0040100F fabs
00401011 fstp qword ptr [esp]
00401014 push offset string "%.3f" (4020F4h)
00401019 call dword ptr [__imp__printf (4020A0h)]
回答4:
It probably just uses a bitmask to set the sign bit to 0.
回答5:
There can be several things:
are you sure the first call uses
std::abs
? It could just as well use the integerabs
from C (either callstd::abs
explicitely, or haveusing std::abs;
)the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)
However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs
, the compiler should know there are no side-effects.
回答6:
Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).
If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.
回答7:
Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.
You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.
You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)
回答8:
The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.
来源:https://stackoverflow.com/questions/2878076/what-is-different-about-c-math-h-abs-compared-to-my-abs