问题
What the difference between HashSet.Union
vs HashSet.Unionwith
when i combine 2 hashsets.
I am trying to combine like this:
HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;
what is the best method for this example and why?
回答1:
Well, it's not HashSet.Union
but Enumerable.Union, so you are using a LINQ extension method that works with any kind of IEnumerable<>
whereas HashSet.UnionWith is a real HashSet
method that modifes the current instance.
Union
returns anIEnumerable<TSource>
UnionWith
isvoid
, it modifies the currentHashSet
instance- maybe
UnionWith
is slightly more efficient because it can be optimized
If you don't want to support any kind of sequence in your method so HashSet
is fix and you can modify it, use that, otherwise use the LINQ extension.
If you create the HashSet
instance just for this purpose it doesn't really matter and i would prefer LINQ to be more flexible and to be able to chain my query.
回答2:
Given HashSet<T> A and HashSet<T>
B there are four ways to A ∪ B:
new HashSet<t>(A.Union(B))
(see HashSet<T&>(IEnumerable<T>) and Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>))- A.UnionWith(B)
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
new HashSet<t>(A.Concat(B))
(see Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>))
Each has its advantages and disadvantages:
- 1 and 4 are expressions that results in a
HashSet
whereas 2 and 3 are statements or statement blocks.
Expression 1 and 4 can be used in more places than 2 and 3. For example, using 2 or 3 in a linq query syntax expression is cumbersome:from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y)
will not work sinceUnionWith
returns void. - 1, 3, and 4 preserve A and B as they are and return a fresh set whereas 2 modifies A.
There are situations where modifying one of the original sets is bad and there are situations where at least one of the original sets can be modified without negative consequences. Computational cost:
A.UnionWith(B)
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|B|))≤
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|A| + |B|))≤
HashSet<T>(A.Concat(B))
(≈ O(log(|A∪B|) * |A∪B|) + O(|A| + |B|))≤
HashSet<T>(A.Union(B))
(≈ 2 * O(log(|A∪B|) * |A∪B|) + O(|A| + |B| + |A∪B|))The next section delves into the reference source to see the basis of these performance estimates.
Performance
HashSet<T>
In union option 1, 3, and 4 the constructor HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) is used to create a HashSet<T>
from an IEnumerable<T>
.
If the passed IEnumerable<T>
has a Count property —i.e. if it is an ICollection<T>—, this property is used to set the size of the new HashSet
:
int suggestedCapacity = 0; ICollection<T> coll = collection as ICollection<T>; if (coll != null) { suggestedCapacity = coll.Count; } Initialize(suggestedCapacity);
-- HashSet.cs line 136–141
The [Count()][10]
method is never called. So if the IEnumerable
's count can be retrieved without effort it is used to reserve capacity; otherwise the HashSet
grows and reallocates when new elements are added.
In option 1 A.Union(B)
and option 4 A.Concat(B)
are not ICollection<T>
therefore the created HashSet
will grow and reallocate a few (≈ log(|A∪B|)) times. Option 3 can use the Count
of A.
The constructor calls UnionWith to fill the new empty HashSet
:
this.UnionWith(collection);
-- HashSet.cs line 143
UnionWith(IEnumerable<T>)
iterates over the elements in the IEnumerable<T>
passed as argument and calls AddIfNotPresent(T) for each.
AddIfNotPresent(T)
inserts elements and ensures that duplicates are never inserted into the set.HashSet<T>
is implemented as an array of slots, m_slots, and an array of buckets, m_buckets. A bucket contains just an int
index into the m_slots
array. Per bucket the Slots in m_slots
form a linked list with an index to the next Slot
in m_slots
.
AddIfNotPresent(T)
jumps to the correct bucket and then traverses its linked list to check if the element is already present:
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) { if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) { return false; } }
-- HashSet.cs line 968–975
Next a free index is found and a slot is reserved. First the list of free slots, m_freelist, is checked. When there are no slots in the freelist, the next empty slot in the m_slots
array is used. More capacity is reserved (via IncreaseCapacity()) if there are no slots in the freelist and no empty slots:
int index; if (m_freeList >= 0) { index = m_freeList; m_freeList = m_slots[index].next; } else { if (m_lastIndex == m_slots.Length) { IncreaseCapacity(); // this will change during resize bucket = hashCode % m_buckets.Length; } index = m_lastIndex; m_lastIndex++; }
-- HashSet.cs line 977–990
AddIfNotPresent(T)
has three operations that demand some computation: the call to object.GetHashCode(), calling object.Equals(object) when there is a collision, and IncreaseCapacity()
. The actual adding of an element only incurs the cost of setting some pointers and some ints.
When the HashSet<T>
needs to IncreaseCapacity()
the capacity is at least doubled. Therefore we can conclude that on average a HashSet<T>
is filled for 75%. If the hashes are evenly distributed, the expectancy of a hash collision is also 75%.
SetCapacity(int, bool), called by IncreaseCapacity()
, is the most expensive: it allocates new arrays, it copies the old slot array to the new array, and it recalculates the bucket lists:
Slot[] newSlots = new Slot[newSize]; if (m_slots != null) { Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex); } ... int[] newBuckets = new int[newSize]; for (int i = 0; i < m_lastIndex; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } m_slots = newSlots; m_buckets = newBuckets;
-- HashSet.cs line 929–949
Option 1 and 4 (new HashSet<T>(A.Union(B))
) will result in slightly more calls to IncreaseCapacity()
. The cost —without the cost of A.Union(B)
or A.Concat(B)
— is approximately O(log(|A∪B|) * |A∪B|).
Whereas when using option 2 (A.UnionWith(B)
) or option 3 (HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)
) we get a 'discount' of log(|A|) on the cost: O((log(|A∪B|) - log(|A|)) * |A∪B|). It pays (slightly) to use the largest set as the target into witch the other is merged.
Enumerable<T>.Union(IEnumerable<T>)
Enumerable<T>.Union(IEnumerable<T>)
is implemented via UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>).
The UnionIterator
uses Set<T> —an internal class in Enumerable.cs
— that is very similar to HashSet<T>
. UnionIterator
lazily Add(T)s items from A and B to this Set<T>
and yields
the elements if they could be added. The work is done in Find(T, bool) which is similar to HashSet<T>.AddIfNotPresent(T)
. Check if the element is already present:
int hashCode = InternalGetHashCode(value); for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) { if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true; }
-- Enumerable.cs line 2423–2426
Find a free index and reserve a slot:
int index; if (freeList >= 0) { index = freeList; freeList = slots[index].next; } else { if (count == slots.Length) Resize(); index = count; count++; } int bucket = hashCode % buckets.Length; slots[index].hashCode = hashCode; slots[index].value = value; slots[index].next = buckets[bucket] - 1; buckets[bucket] = index + 1;
-- Enumerable.cs line 2428–2442
Resize() is similar to IncreaseCapacity()
. The biggest difference between the two is that Resize()
does not use a prime number for the number of buckets, so with bad GetHashCode()
there is a slightly higher chance for collisions. Code of Resize()
:
int newSize = checked(count * 2 + 1); int[] newBuckets = new int[newSize]; Slot[] newSlots = new Slot[newSize]; Array.Copy(slots, 0, newSlots, 0, count); for (int i = 0; i < count; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } buckets = newBuckets; slots = newSlots;
-- Enumerable.cs line 2448–2458
The performance cost of A.Union(B)
is not significantly different from that of HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);
. In option 1 (new HashSet<T>(A.Union(B))
) the same HashSet
is created twice resulting in a very costly 2 * O(log(|A∪B|) * (|A∪B|)). Option 4 results from knowing how HashSet<T>(IEnumerable<T>)
and Enumerable.Union(IEnumerable<T>, IEnumerable<T>)
are implemented. It avoids the redundant A.Union(B)
leading to a cost of O(log(|A∪B|) * |A∪B|).
来源:https://stackoverflow.com/questions/45917831/union-vs-unionwith-in-hashset