I found this in the .NET Source Code: It claims to be 100 times faster than System.Double.IsNaN
. Is there a reason to not use this function instead of Sys
I call shenanigans. The "fast" version has a considerably larger number of ops and even performs more reads from memory, (stack, so in L1 but still slower than registers).
00007FFAC53D3D01 movups xmmword ptr [rsp+8],xmm0
00007FFAC53D3D06 sub rsp,48h
00007FFAC53D3D0A mov qword ptr [rsp+20h],0
00007FFAC53D3D13 mov qword ptr [rsp+28h],0
00007FFAC53D3D1C mov qword ptr [rsp+30h],0
00007FFAC53D3D25 mov rax,7FFAC5423D40h
00007FFAC53D3D2F mov eax,dword ptr [rax]
00007FFAC53D3D31 test eax,eax
00007FFAC53D3D33 je 00007FFAC53D3D3A
00007FFAC53D3D35 call 00007FFB24EE39F0
00007FFAC53D3D3A mov r8d,8
00007FFAC53D3D40 xor edx,edx
00007FFAC53D3D42 lea rcx,[rsp+20h]
00007FFAC53D3D47 call 00007FFB24A21680
t.DoubleValue = value;
00007FFAC53D3D4C movsd xmm5,mmword ptr [rsp+50h]
00007FFAC53D3D52 movsd mmword ptr [rsp+20h],xmm5
UInt64 exp = t.UintValue & 0xfff0000000000000;
00007FFAC53D3D58 mov rax,qword ptr [rsp+20h]
00007FFAC53D3D5D mov rcx,0FFF0000000000000h
00007FFAC53D3D67 and rax,rcx
00007FFAC53D3D6A mov qword ptr [rsp+28h],rax
UInt64 man = t.UintValue & 0x000fffffffffffff;
00007FFAC53D3D6F mov rax,qword ptr [rsp+20h]
00007FFAC53D3D74 mov rcx,0FFFFFFFFFFFFFh
00007FFAC53D3D7E and rax,rcx
00007FFAC53D3D81 mov qword ptr [rsp+30h],rax
return (exp == 0x7ff0000000000000 || exp == 0xfff0000000000000) && (man != 0);
00007FFAC53D3D86 mov rax,7FF0000000000000h
00007FFAC53D3D90 cmp qword ptr [rsp+28h],rax
00007FFAC53D3D95 je 00007FFAC53D3DA8
00007FFAC53D3D97 mov rax,0FFF0000000000000h
00007FFAC53D3DA1 cmp qword ptr [rsp+28h],rax
00007FFAC53D3DA6 jne 00007FFAC53D3DBD
00007FFAC53D3DA8 xor eax,eax
00007FFAC53D3DAA cmp qword ptr [rsp+30h],0
00007FFAC53D3DB0 setne al
00007FFAC53D3DB3 mov dword ptr [rsp+38h],eax
00007FFAC53D3DB7 mov al,byte ptr [rsp+38h]
00007FFAC53D3DBB jmp 00007FFAC53D3DC1
00007FFAC53D3DBD xor eax,eax
00007FFAC53D3DBF jmp 00007FFAC53D3DC1
00007FFAC53D3DC1 nop
00007FFAC53D3DC2 add rsp,48h
00007FFAC53D3DC6 ret
Versus the .NET version:
return (*(UInt64*)(&d) & 0x7FFFFFFFFFFFFFFFL) > 0x7FF0000000000000L;
00007FFAC53D3DE0 movsd mmword ptr [rsp+8],xmm0
00007FFAC53D3DE6 sub rsp,38h
00007FFAC53D3DEA mov rax,7FFAC5423D40h
00007FFAC53D3DF4 mov eax,dword ptr [rax]
00007FFAC53D3DF6 test eax,eax
00007FFAC53D3DF8 je 00007FFAC53D3DFF
00007FFAC53D3DFA call 00007FFB24EE39F0
00007FFAC53D3DFF mov rdx,qword ptr [rsp+40h]
00007FFAC53D3E04 mov rax,7FFFFFFFFFFFFFFFh
00007FFAC53D3E0E and rdx,rax
00007FFAC53D3E11 xor ecx,ecx
00007FFAC53D3E13 mov rax,7FF0000000000000h
00007FFAC53D3E1D cmp rdx,rax
00007FFAC53D3E20 seta cl
00007FFAC53D3E23 mov dword ptr [rsp+20h],ecx
00007FFAC53D3E27 movzx eax,byte ptr [rsp+20h]
00007FFAC53D3E2C jmp 00007FFAC53D3E2E
00007FFAC53D3E2E nop
00007FFAC53D3E2F add rsp,38h
00007FFAC53D3E33 ret
It claims to be 100 times faster than System.Double.IsNaN
Yes, that used to be true. You are missing the time-machine to know when this decision was made. Double.IsNaN() didn't used to look like that. From the SSCLI10 source code:
public static bool IsNaN(double d)
{
// Comparisions of a NaN with another number is always false and hence both conditions will be false.
if (d < 0d || d >= 0d) {
return false;
}
return true;
}
Which performs very poorly on the FPU in 32-bit code if d
is NaN. Just an aspect of the chip design, it is treated as exceptional in the micro-code. The Intel processor manuals say very little about it, other than documenting a processor perf counter that tracks the number of "Floating Point assists" and noting that the micro-code sequencer comes into play for denormals and NaNs, "potentially costing
hundreds of cycles". Not otherwise an issue in 64-bit code, it uses SSE2 instructions which don't have this perf hit.
Some code to play with to see this yourself:
using System;
using System.Diagnostics;
class Program {
static void Main(string[] args) {
double d = double.NaN;
for (int test = 0; test < 10; ++test) {
var sw1 = Stopwatch.StartNew();
bool result1 = false;
for (int ix = 0; ix < 1000 * 1000; ++ix) {
result1 |= double.IsNaN(d);
}
sw1.Stop();
var sw2 = Stopwatch.StartNew();
bool result2 = false;
for (int ix = 0; ix < 1000 * 1000; ++ix) {
result2 |= IsNaN(d);
}
sw2.Stop();
Console.WriteLine("{0} - {1} x {2}%", sw1.Elapsed, sw2.Elapsed, 100 * sw2.ElapsedTicks / sw1.ElapsedTicks, result1, result2);
}
Console.ReadLine();
}
public static bool IsNaN(double d) {
// Comparisions of a NaN with another number is always false and hence both conditions will be false.
if (d < 0d || d >= 0d) {
return false;
}
return true;
}
}
Which uses the version of Double.IsNaN() that got micro-optimized. Such micro-optimizations are not evil in a framework btw, the great burden of the Microsoft .NET programmers is that they can rarely guess when their code is in the critical path of an application.
Results on my machine when targeting 32-bit code (Haswell mobile core):
00:00:00.0027095 - 00:00:00.2427242 x 8957%
00:00:00.0025248 - 00:00:00.2191291 x 8678%
00:00:00.0024344 - 00:00:00.2209950 x 9077%
00:00:00.0024144 - 00:00:00.2321169 x 9613%
00:00:00.0024126 - 00:00:00.2173313 x 9008%
00:00:00.0025488 - 00:00:00.2237517 x 8778%
00:00:00.0026940 - 00:00:00.2231146 x 8281%
00:00:00.0025052 - 00:00:00.2145660 x 8564%
00:00:00.0025533 - 00:00:00.2200943 x 8619%
00:00:00.0024406 - 00:00:00.2135839 x 8751%
Here's a naive benchmark:
public static void Main()
{
int iterations = 500 * 1000 * 1000;
double nan = double.NaN;
double notNan = 42;
Stopwatch sw = Stopwatch.StartNew();
bool isNan;
for (int i = 0; i < iterations; i++)
{
isNan = IsNaN(nan); // true
isNan = IsNaN(notNan); // false
}
sw.Stop();
Console.WriteLine("IsNaN: {0}", sw.ElapsedMilliseconds);
sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
isNan = double.IsNaN(nan); // true
isNan = double.IsNaN(notNan); // false
}
sw.Stop();
Console.WriteLine("double.IsNaN: {0}", sw.ElapsedMilliseconds);
Console.Read();
}
Obviously they're wrong:
IsNaN: 15012
double.IsNaN: 6243
EDIT + NOTE: I'm sure the timing will change depending on input values, many other factors etc., but claiming that generally speaking this wrapper is 100x faster than the default implementation seems just wrong.