问题
I stumbled upon the following problem.
I want a hashset with all numbers from 1 to 100.000.000.
I tried the following code:
var mySet = new HashSet<int>();
for (var k = 1; k <= 100000000; k++)
mySet.Add(k);
That code didn't make it since I got memory overflow somewhere around the 49mil. This was also pretty slow and memory grew excessively.
Then I tried this.
var mySet = Enumerable.Range(1, 100000000).ToHashSet();
where ToHashSet() is the following code:
public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source)
{
return new HashSet<T>(source);
}
I got a memory overflow again but I was able to put in more numbers then with the previous code.
The thing that does work is the following:
var tempList = new List<int>();
for (var k = 1; k <= 100000000; k++)
tempList.Add(k);
var numbers = tempList.ToHashSet();
It takes about 800ms on my system to just fill the tempList where the Enumerable.Range() only takes 4 ticks!
I do need that HashSet or else it would take to much time to lookup values (I need it to be O(1)) and it would be great if I could do that the fastest way.
Now my question is:
Why do the first two methods cause a memory overflow where the third doesn't?
Is there something special HashSet does with the memory on initializing?
My system has 16GB memory so i was quite surprised when I got the overflow exceptions.
回答1:
Like other collection types, the HashSet will automatically increase its capacity as required as you add elements. When adding a large number of elements, this will result in a large number of reallocations.
If you initialize it with a constructor that takes an IEnumerable<T>
, it will check if the IEnumerable<T>
is in fact an ICollection<T>
, and if so, initialize the HashSet's capacity to the size of the collection.
This is what's happening in you're third example - you're adding a List<T>
which is also an ICollection<T>
, so your HashSet is given an initial capacity equal to the size of the list, thus ensuring that no reallocations are needed.
You will be even more efficient if you use the List<T>
constructor that takes a capacity parameter, as this will avoid reallocations when building the list:
var noElements = 100000000;
var tempList = new List<int>(noElements);
for (var k = 1; k <= noElements; k++)
tempList.Add(k);
var numbers = tempList.ToHashSet();
As for your system memory; check if this is a 32-bit or 64-bit process. A 32-bit process has a maximum of 2GB memory available (3GB if you've used the /3GB startup switch).
Unlike other collection types (e.g. List<T>
, Dictionary<TKey,TValue>
), HashSet<T>
doesn't have a constructor that takes a capacity
parameter to set the initial capacity. If you want to initialize a HashSet<T>
with a large number of elements, the most efficient way to do so is probably to first add the elements to an array or List<T>
with the appropriate capacity, then pass this array or list to the HashSet<T>
constructor.
回答2:
I guess HashSet<T>
, like most .net collections, uses the array doubling strategy for growth. Unfortunately there are no constructor overloads that take a capacity.
But if it checks for ICollection<T>
and used ICollection<T>.Count
as initial capacity you can implement a rudimentary implementation of ICollection<T>
that implements GetEnumerator()
and Count
. That way you can directly fill the HashSet<T>
without materializing a temporary List<T>
.
回答3:
If you put 100 million ints into a hashset that will consume 1.5GB (my machine) If you create a bool[100000000] where you set each number you've had to true it takes only 100MB and also looks up faster than a hashset. This assumes the ints range from 0-100000000
回答4:
HashSet
grows by doubling and that allocation causes it to exceed available memory.
On a 64-bit system a HashSet can hold upwards of 89 million items.
On a 32-bit system the limit is about 61.7 million items.
that's why you are getting memory overflow exception
for more info
http://blog.mischel.com/2008/04/09/hashset-limitations/
来源:https://stackoverflow.com/questions/11557056/what-does-a-hashset-do-with-memory-when-initializing-a-collection