问题
I have a method that takes an upper limit, and returns a list of primes numbers up to that limit.
public static List<int> AllPrimesUnder(int upperLimit)
I later decided that I really just needed to do lookups on the list, often just asking the question "Is This Prime". Since I was dealing with all primes under values like a million, I realized that HashSet was the structure I should be using. Certainly the lookup using the result of the method was faster, but the method its self was slower.
I believe the reason it's slower is because HashSet checks for duplicates before adding, while a List just shoves it on the end. What surprised me, and what spawned the question and title, is why starting with a List and using it to create HashSet, like so:
hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));
is faster than using a Hashset internal to the method, enabling a call like this:
hashSet = Prime.AllPrimesUnder_Hash(1000000);
If the slowdown is in the duplicate checking, it should have to do the same amount of checking no matter what, right? This is likely where my understanding is failing me.
Here are the times I'm getting for primes under one million.
- 0.1136s Pure Hash
- 0.0975s Pure List (expected to be faster)
- 0.0998s Pure List Converted to Hash (not expected)
If the reason for this can be explained in simple terms, I'd love to hear it. I suppose at a minimum what I'm looking for is enough of an understanding to know if I should start with a List or a HashSet if the end result will be a large HashSet of items.
I've added the body of the prime method below, but note that all interaction with the data structure is identical (code wise) between the two. I don't believe how I add data to the structure should effect the anomaly.
public static List<int> AllPrimesUnder(int upperLimit)
{
List<int> primeList = new List<int>();
primeList.Add(2);
int testNumber = 3;
bool isPrime;
while (testNumber <= upperLimit)
{
isPrime = true;
foreach (int prime in primeList)
{
if (testNumber % prime == 0)
{
isPrime = false;
break;
}
if (testNumber < prime*prime)
break;
}
if (isPrime)
primeList.Add(testNumber);
testNumber++;
}
return primeList;
}
Edit: By request I am adding the code for the Hash method. If it looks nearly identical that's because it is.
public static HashSet<int> AllPrimesUnder_Hash(int upperLimit)
{
HashSet<int> primeHash = new HashSet<int>();
primeHash.Add(2);
int testNumber = 3;
bool isPrime;
while (testNumber <= upperLimit)
{
isPrime = true;
foreach (int prime in primeHash)
{
if (testNumber % prime == 0)
{
isPrime = false;
break;
}
if (testNumber < prime*prime)
break;
}
if (isPrime)
primeHash.Add(testNumber);
testNumber++;
}
return primeList;
}
Also by request is the (ugly hackish) code I used to test the execution time:
Stopwatch stopWatch = new Stopwatch();
int iterations = 1;
HashSet<int> hashSet = new HashSet<int>();
List<int> list = new List<int>();
stopWatch.Restart();
for (int i = 0; i < iterations; i++)
{
hashSet = Prime.AllPrimesUnder_Hash(1000000);
}
stopWatch.Stop();
Console.WriteLine("Hash: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));
//////////////////////////
stopWatch.Restart();
for (int i = 0; i < iterations; i++)
{
hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));
}
stopWatch.Stop();
Console.WriteLine("List converted: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));
回答1:
In AllPrimesUnder
you are enumerating the prime list many times (once for each prime candidate). Enumerating a List
is faster than enumerating a HashSet
because the internal array of the HashSet
is more sparse.
Not seeing the code for AllPrimesUnder_Hash
I guess that this is the main cause.
I'm not convinced that resizing a list of a few thousand items could consume 20ms. Copying memory using memcpy
(which is what happens internally) is one of the highest-throughput operations you can do. You can copy tens of gigabytes per second per core.
回答2:
The reason is that when the HashSet
is initialized with a collection it can use the size of the collection to set the capacity. When adding values to an empty HashSet
the capacity needs to be increased from time to time and that is an O(n) operation.
For some reason the HashSet
does not take capacity as a parameter in the constructor like the List
does.
回答3:
Looking at your algorithm I suspect that the pure hash is slower because it is a hash, not an ordered list. When using an ordered list, you test divisibility against 2, 3, 5, 7, etc. in order, so the smaller divisors (which are more commonly divisors) are tested first. When using a hash, the order is arbitrary, so you may test for divisible by 23 before you test for divisible by 3.
BTW, you should use testnumber += 2, and exclude 2 from your list of primes, inserting 2 when you are done with your loop.
Even better, Sieve of Eratosthenes is usually a faster way to calculate all primes for relatively small numbers. Or even better, precompute your low value primes and load it from disk
EDIT -- ADDED
Not what I expected initially (a hash being out of order), but it just looks like a good bit more overhead in the MoveNext() -- which is how foreach works internally
Compare the difference in the MoveNext() functions -- which you will call millions of times in the innermost loop.
// HashSet<>.MoveNext()
public bool MoveNext()
{
if (this.version != this.set.m_version)
{
throw new InvalidOperationException(SR.GetString("InvalidOperation_EnumFailedVersion"));
}
while (this.index < this.set.m_lastIndex)
{
if (this.set.m_slots[this.index].hashCode >= 0)
{
this.current = this.set.m_slots[this.index].value;
this.index++;
return true;
}
this.index++;
}
this.index = this.set.m_lastIndex + 1;
this.current = default(T);
return false;
}
List<>.MoveNext()
public bool MoveNext()
{
List<T> list = this.list;
if ((this.version == list._version) && (this.index < list._size))
{
this.current = list._items[this.index];
this.index++;
return true;
}
return this.MoveNextRare(); // this call should be rare as the name implies
}
private bool MoveNextRare()
{
if (this.version != this.list._version)
{
ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
}
this.index = this.list._size + 1;
this.current = default(T);
return false;
}
来源:https://stackoverflow.com/questions/18987841/in-c-sharp-why-is-it-faster-to-create-a-hashset-from-a-list-instead-of-starting