Why HashSet class is not used to implement Enumerable.Distinct

前端 未结 1 959
日久生厌
日久生厌 2021-01-03 15:57

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation

So I was looking at the implementation of ex

相关标签:
1条回答
  • 2021-01-03 16:32

    which is almost a classical implementation of a hash table with "open addressing"

    Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.

    a lot of code in Set is just a copy-paste from HashSet, with some omissions

    AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.

    It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells

    In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.

    As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.

    Especially given the fact that both of these classes are in the same assembly System.Core

    Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.

    Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.

    That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think @james-ko was considering testing the benefits of doing that.

    It looks like HashSet<T> will perform better

    For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).

    so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.

    Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

    0 讨论(0)
提交回复
热议问题