I am writing algorithms that work on series of numeric data, where sometimes, a value in the series needs to be null. However, because this application is performance critical,
Well, if you've ruled out Nullable<T>
, you are left with domain values - i.e. a magic number that you treat as null. While this isn't ideal, it isn't uncommon either - for example, a lot of the main framework code treats DateTime.MinValue
the same as null. This at least moves the damage far away from common values...
edit to highlight only where no NaN
So where there is no NaN
, maybe use .MinValue
- but just remember what evils happen if you accidentally use that same value meaning the same number...
Obviously for unsigned data you'll need .MaxValue
(avoid zero!!!).
Personally, I'd try to use Nullable<T>
as expressing my intent more safely... there may be ways to optimise your Nullable<T>
code, perhaps. And also - by the time you've checked for the magic number in all the places you need to, perhaps it won't be much faster than Nullable<T>
?
Partial answer:
Float and Double provide NaN (Not a Number). NaN is a little tricky since, per spec, NaN != NaN. If you want to know if a number is NaN, you'll need to use Double.IsNaN().
See also Binary floating point and .NET.
Maybe the significant performance decrease happens when calling one of Nullable's members or properties (boxing).
Try to use a struct with the double + a boolean telling whether the value is specified or not.
I somewhat disagree with Gravell on this specific edge case: a Null-ed variable is considered 'not defined', it doesn't have a value. So whatever is used to signal that is OK: even magic numbers, but with magic numbers you have to take into account that a magic number will always haunt you in the future when it becomes a 'valid' value all of a sudden. With Double.NaN you don't have to be afraid for that: it's never going to become a valid double. Though, you have to consider that NaN in the sense of the sequence of doubles can only be used as a marker for 'not defined', you can't use it as an error code in the sequences as well, obviously.
So whatever is used to mark 'undefined': it has to be clear in the context of the set of values that that specific value is considered the value for 'undefined' AND that won't change in the future.
If Nullable give you too much trouble, use NaN, or whatever else, as long as you consider the consequences: the value chosen represents 'undefined' and that will stay.
I am working on a large project that uses NaN as a null
value. I am not entirely comfortable with it - for similar reasons as yours: not knowing what can go wrong. We haven't encountered any real problems so far, but be aware of the following:
NaN arithmetics - While, most of the time, "NaN promotion" is a good thing, it might not always be what you expect.
Comparison - Comparison of values gets rather expensive, if you want NaN's to compare equal. Now, testing floats for equality isn't simple anyway, but ordering (a < b) can get really ugly, because nan's sometimes need to be smaller, sometimes larger than normal values.
Code Infection - I see lots of arithmetic code that requires specific handling of NaN's to be correct. So you end up with "functions that accept NaN's" and "functions that don't" for performance reasons.
Other non-finites NaN is nto the only non-finite value. Should be kept in mind...
Floating Point Exceptions are not a problem when disabled. Until someone enables them. True story: Static intialization of a NaN in an ActiveX control. Doesn't sound scary, until you change installation to use InnoSetup, which uses a Pascal/Delphi(?) core, which has FPU exceptions enabled by default. Took me a while to figure out.
So, all in all, nothing serious, though I'd prefer not to have to consider NaNs that often.
I'd use Nullable types as often as possible, unless they are (proven to be) performance / ressource constraints. One case could be large vectors / matrices with occasional NaNs, or large sets of named individual values where the default NaN behavior is correct.
Alternatively, you can use an index vector for vectors and matrices, standard "sparse matrix" implementations, or a separate bool/bit vector.
One can avoid some of the performance degradation associated with Nullable<T>
by defining your own structure
struct MaybeValid<T>
{
public bool isValue;
public T Value;
}
If desired, one may define constructor, or a conversion operator from T
to MaybeValid<T>
, etc. but overuse of such things may yield sub-optimal performance. Exposed-field structs can be efficient if one avoids unnecessary data copying. Some people may frown upon the notion of exposed fields, but they can be massively more efficient that properties. If a function that will return a T
would need to have a variable of type T
to hold its return value, using a MaybeValid<Foo>
simply increases by 4 the size of thing to be returned. By contrast, using a Nullable<Foo>
would require that the function first compute the Foo
and then pass a copy of it to the constructor for the Nullable<Foo>
. Further, returning a Nullable<Foo>
will require that any code that wants to use the returned value must make at least one extra copy to a storage location (variable or temporary) of type Foo
before it can do anything useful with it. By contrast, code can use the Value
field of a variable of type Foo
about as efficiently as any other variable.