I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct
in big O notation
So I was looking at the implementation of ex
which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next
field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable
and HashSet
were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T>
started by copying Set<T>
and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet
's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set
is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set
takes advantage of the fact that once an item is removed (which happens, for example, during Intersect
) there will never be an item added, which allows it to leave out freelist
and any work related to it, which HashSet
couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>
, one in the assembly that has System.Linq
and one in the separate assembly that has System.Linq.Expressions
. The former got trimmed down as described above, the latter replaced with a use of HashSet<T>
as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet()
method in .NET Core's version of Linq makes the possibility of replacing Set<T>
with HashSet<T>
more justifiable, though not a no-brainer. I think @james-ko was considering testing the benefits of doing that.
It looks like
HashSet<T>
will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using
Distinct
extension method, and write my own extension method that would useHashSet<T>
instead ofSet<T>
.
Use Distinct()
. If you've a bottle neck then it might be that HashSet<T>
will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode()
of the types in question could be improved for either speed or distribution of bits, first).