Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I\'ve used the cryptographic hash functions for this in the pa
Have you compared with the SHA1CryptoServiceProvider.ComputeHash method? It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. I used it in an Identicon Handler that performed pretty well under load.
Whether you want a perfect hashfunction (different value for each object that evaluates to equal) or just a pretty good one is always a performance tradeoff, it takes normally time to compute a good hashfunction and if your dataset is smallish you're better of with a fast function. The most important (as your second post points out) is correctness, and to achieve that all you need is to return the Length of the array. Depending on your dataset that might even be ok. If it isn't (say all your arrays are equally long) you can go with something cheap like looking at the first and last value and XORing their values and then add more complexity as you see fit for your data.
A quick way to see how your hashfunction performs on your data is to add all the data to a hashtable and count the number of times the Equals function gets called, if it is too often you have more work to do on the function. If you do this just keep in mind that the hashtable's size needs to be set bigger than your dataset when you start, otherwise you are going to rehash the data which will trigger reinserts and more Equals evaluations (though possibly more realistic?)
For some objects (not this one) a quick HashCode can be generated by ToString().GetHashCode(), certainly not optimal, but useful as people tend to return something close to the identity of the object from ToString() and that is exactly what GetHashcode is looking for
Trivia: The worst performance I have ever seen was when someone by mistake returned a constant from GetHashCode, easy to spot with a debugger though, especially if you do lots of lookups in your hashtable
RuntimeHelpers.GetHashCode might help:
From Msdn:
Serves as a hash function for a particular type, suitable for use in hashing algorithms and data structures such as a hash table.
I found interesting results:
I have the class:
public class MyHash : IEquatable<MyHash>
{
public byte[] Val { get; private set; }
public MyHash(byte[] val)
{
Val = val;
}
/// <summary>
/// Test if this Class is equal to another class
/// </summary>
/// <param name="other"></param>
/// <returns></returns>
public bool Equals(MyHash other)
{
if (other.Val.Length == this.Val.Length)
{
for (var i = 0; i < this.Val.Length; i++)
{
if (other.Val[i] != this.Val[i])
{
return false;
}
}
return true;
}
else
{
return false;
}
}
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
}
Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. I did the following
// dictionary we use to check for collisions
Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();
// used to generate random arrays
Random rand = new Random();
var now = DateTime.Now;
for (var j = 0; j < 100; j++)
{
for (var i = 0; i < 5000; i++)
{
// create new array and populate it with random bytes
byte[] randBytes = new byte[byte.MaxValue];
rand.NextBytes(randBytes);
MyHash h = new MyHash(randBytes);
if (checkForDuplicatesDic.ContainsKey(h))
{
Console.WriteLine("Duplicate");
}
else
{
checkForDuplicatesDic[h] = true;
}
}
Console.WriteLine(j);
checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
}
var elapsed = DateTime.Now - now;
Console.Read();
Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode()
The method that was by far the fastest and had the least number of collisions was:
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
that took 2 seconds to execute. The method
public override int GetHashCode()
{
// 7.1 seconds
unchecked
{
const int p = 16777619;
int hash = (int)2166136261;
for (int i = 0; i < Val.Length; i++)
hash = (hash ^ Val[i]) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
}
had no collisions also but it took 7 seconds to execute!
The hash code of an object does not need to be unique.
The checking rule is:
Equals
method.All you want is a GetHashCode
algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable
or Dictionary<>
will need to use the hash to optimise retrieval.
How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.
GetHashCode
should be a lot quicker than Equals
, but doesn't need to be unique.
Two identical things must never have different hash codes. Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).