In C# why is it faster to Create a HashSet from a List, instead of starting with a HashSet?

问题

I have a method that takes an upper limit, and returns a list of primes numbers up to that limit.

    public static List<int> AllPrimesUnder(int upperLimit)

I later decided that I really just needed to do lookups on the list, often just asking the question "Is This Prime". Since I was dealing with all primes under values like a million, I realized that HashSet was the structure I should be using. Certainly the lookup using the result of the method was faster, but the method its self was slower.

I believe the reason it's slower is because HashSet checks for duplicates before adding, while a List just shoves it on the end. What surprised me, and what spawned the question and title, is why starting with a List and using it to create HashSet, like so:

    hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));

is faster than using a Hashset internal to the method, enabling a call like this:

    hashSet = Prime.AllPrimesUnder_Hash(1000000);

If the slowdown is in the duplicate checking, it should have to do the same amount of checking no matter what, right? This is likely where my understanding is failing me.

Here are the times I'm getting for primes under one million.

0.1136s Pure Hash
0.0975s Pure List (expected to be faster)
0.0998s Pure List Converted to Hash (not expected)

If the reason for this can be explained in simple terms, I'd love to hear it. I suppose at a minimum what I'm looking for is enough of an understanding to know if I should start with a List or a HashSet if the end result will be a large HashSet of items.

I've added the body of the prime method below, but note that all interaction with the data structure is identical (code wise) between the two. I don't believe how I add data to the structure should effect the anomaly.

    public static List<int> AllPrimesUnder(int upperLimit)
    {
        List<int> primeList = new List<int>();
        primeList.Add(2);
        int testNumber = 3;
        bool isPrime;

        while (testNumber <= upperLimit)
        {
            isPrime = true;

            foreach (int prime in primeList)
            {
                if (testNumber % prime == 0)
                {
                    isPrime = false;
                    break;
                }
                if (testNumber < prime*prime)
                    break;
            }

            if (isPrime)
                primeList.Add(testNumber);

            testNumber++;
        }

        return primeList;
    }

Edit: By request I am adding the code for the Hash method. If it looks nearly identical that's because it is.

public static HashSet<int> AllPrimesUnder_Hash(int upperLimit)
{
    HashSet<int> primeHash = new HashSet<int>();
    primeHash.Add(2);
    int testNumber = 3;
    bool isPrime;

    while (testNumber <= upperLimit)
    {
        isPrime = true;

        foreach (int prime in primeHash)
        {
            if (testNumber % prime == 0)
            {
                isPrime = false;
                break;
            }
            if (testNumber < prime*prime)
                break;
        }

        if (isPrime)
            primeHash.Add(testNumber);

        testNumber++;
    }

    return primeList;
}

Also by request is the (ugly hackish) code I used to test the execution time:

        Stopwatch stopWatch = new Stopwatch();
        int iterations = 1;
        HashSet<int> hashSet = new HashSet<int>();
        List<int> list = new List<int>();

        stopWatch.Restart();
        for (int i = 0; i < iterations; i++)
        {
            hashSet = Prime.AllPrimesUnder_Hash(1000000);
        }
        stopWatch.Stop();

        Console.WriteLine("Hash: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));

//////////////////////////

        stopWatch.Restart();
        for (int i = 0; i < iterations; i++)
        {
            hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));
        }
        stopWatch.Stop();


        Console.WriteLine("List converted: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));

回答1:

In AllPrimesUnder you are enumerating the prime list many times (once for each prime candidate). Enumerating a List is faster than enumerating a HashSet because the internal array of the HashSet is more sparse.

Not seeing the code for AllPrimesUnder_Hash I guess that this is the main cause.

I'm not convinced that resizing a list of a few thousand items could consume 20ms. Copying memory using memcpy (which is what happens internally) is one of the highest-throughput operations you can do. You can copy tens of gigabytes per second per core.

回答2:

The reason is that when the HashSet is initialized with a collection it can use the size of the collection to set the capacity. When adding values to an empty HashSet the capacity needs to be increased from time to time and that is an O(n) operation.
For some reason the HashSet does not take capacity as a parameter in the constructor like the List does.

回答3:

Looking at your algorithm I suspect that the pure hash is slower because it is a hash, not an ordered list. When using an ordered list, you test divisibility against 2, 3, 5, 7, etc. in order, so the smaller divisors (which are more commonly divisors) are tested first. When using a hash, the order is arbitrary, so you may test for divisible by 23 before you test for divisible by 3.

BTW, you should use testnumber += 2, and exclude 2 from your list of primes, inserting 2 when you are done with your loop.

Even better, Sieve of Eratosthenes is usually a faster way to calculate all primes for relatively small numbers. Or even better, precompute your low value primes and load it from disk

EDIT -- ADDED

Not what I expected initially (a hash being out of order), but it just looks like a good bit more overhead in the MoveNext() -- which is how foreach works internally

Compare the difference in the MoveNext() functions -- which you will call millions of times in the innermost loop.

// HashSet<>.MoveNext()
public bool MoveNext()
{
    if (this.version != this.set.m_version)
    {
        throw new InvalidOperationException(SR.GetString("InvalidOperation_EnumFailedVersion"));
    }
    while (this.index < this.set.m_lastIndex)
    {
        if (this.set.m_slots[this.index].hashCode >= 0)
        {
            this.current = this.set.m_slots[this.index].value;
            this.index++;
            return true;
        }
        this.index++;
    }
    this.index = this.set.m_lastIndex + 1;
    this.current = default(T);
    return false;
}


List<>.MoveNext()
public bool MoveNext()
{
    List<T> list = this.list;
    if ((this.version == list._version) && (this.index < list._size))
    {
        this.current = list._items[this.index];
        this.index++;
        return true;
    }
    return this.MoveNextRare(); // this call should be rare as the name implies
}

private bool MoveNextRare()
{
    if (this.version != this.list._version)
    {
        ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
    }
    this.index = this.list._size + 1;
    this.current = default(T);
    return false;
}

来源：https://stackoverflow.com/questions/18987841/in-c-sharp-why-is-it-faster-to-create-a-hashset-from-a-list-instead-of-starting

标签

.net

hashset