Good way to hash a float vector?

问题

I am well aware of all the problems involved in comparing floats. This is exactly the reason for this question.
I'm looking to create a fast hash table for values that are 3D vectors (3 floats - x,y,z). It can be assumed that the length of the vector is always 1.0 (sqrt(x*x+y*y+z*z) is 1.0)

Essentially this means that I'm looking for a hash function that takes values that are almost equal to the same unsigned int value and a corresponding equality operator that is true if the hash values are equal (not not necessarily only if they are equal)

Edit -
False positives (i.e. vectors that are different but map to the same bucket) are a given since this is a hash table.
False negatives (i.e. Vectors that are close but map to different buckets) are undesirable but it seems that there is no way to avoid them. In my case, they will not causes total breakage, just some data duplication which is something I'll have to live with.

回答1:

I think what you're looking for is not directly possible. One important property of equality is that it is transitive. (ie. If a == b and b == c, then a == c). With a distance measure, though, you really don't want this property. Example:

Take a single float (for simplicity). Suppose we want to hash each float so that floats less than 1e-3 away are the same value. Now, suppose we add to this hash table 1000 float values all 1e-4 apart. Any neighboring 2 values should hash to the same float, since they are closer than 1e-3. However, because of transitivity, the neighbors of those values should also has to the same value, and their neighbors and so on. As a result, all 1000 values, including pairs farther than 1e-3 apart, would hash to the same integer. If you were to draw these points on a picture:

A  B  C  D  E  F  G  H ... Y Z

Suppose all the gaps are < 1e-3 apart, but A and Z are > 1e-3 apart (not to scale!). This can't be satisfied because if hash(A) == hash(B) and hash(B) == hash(C) and so on for all pairs, (since they are < 1e-3 apart) than hash(A) must == hash(Z).

One possible option is to define regions of your vector space in which all vectors would hash to the same value (ie round them before hashing them), but you could still get 2 vectors on the edges of their respective spaces that are close together but hash to a different value. You could get around that by searching all neighboring spaces for a vector. (ie in the 1-d case above, you'd round all vectors to the nearest multiple of 1e-3, and then search the neighbors, so 5.3e-3 would search 5e-3, 4e-3 and 6-e3. In higher-dimensional cases, you'd have to search neighbors in all dimensions.)

回答2:

Some languages (C, Java 5) allow you to access the binary value of your floats. This way, you can extract the first N bits of the mantissa (ignoring the last few bits which cause the trouble during compare) and calculate the hash from that.

回答3:

I'd convert the float values into integers like this:

unsigned int IntValue = (int)(floatValue * MULT) + MULT;

so you get some of the first digits and then use

const MULT1 = (MULT << 1) + 1;
unsigned long long HashValue = (xIntValue * MULT1  * MULT1) + (yIntValue * MULT1) + zIntValue;

as a hash value (using (MULT * 2) + 1 because the IntValues will be between 0 and MULT * 2 inclusive).

The memory needed will be depending on the multiplicator MULT. For example, using 32 you'll get a hashtable using 64 * 64 * 64 * (Hash item size) = 262144 * (Hash item size) bytes.

回答4:

Can you ellaborate on your problem?

Assuming that you are using a hashmap to map some additional data to specific vectors, you could just use the XOR of the binary representations of the components (if this is possible in your language of choice). Then use as many LSBs (to reduce collisions) as you need for the hash map. This would of course have the property that two equal (by floating point comparison) vectors might not have the same hash (e.g. IEEE floating point 0 equals to -0, but they have different sign bit).

However, if you are planning on using vectors that are results of different computations to do hash lookup, you are setting yourself up to the possibility of not having matching hash codes due to rounding errors and you should probably be using something else anyway.

回答5:

I think you are effectively trying to solving the K nearest problem. I believe what you are looking for is locality sensitive hashing. Also you can use quad tree structures to achieve the same result.

回答6:

don't know how fast this might be, but since you have unit vectors, they all lie on the surface of a sphere. convert to a http://en.wikipedia.org/wiki/Spherical_coordinate_system. then use phi and theta to choose a bucket. there will be no false positives. you can look in the neighboring cells for false negatives.

回答7:

Do you need it to be a fast hash table or would a tree structure do?

It seems to me that it would be easier to look up matching floats in a search tree of some sort. A B-Tree minimizes the number of cache misses, assuming that you choose the right node size. That ought to make it pretty fast in practice.

来源：https://stackoverflow.com/questions/650175/good-way-to-hash-a-float-vector

标签

floating-point

hashtable

hash