I save 100.000 Vectors of in a database. Each vector has a dimension 60. (int vector[60])
Then I take one and want present vectors to the user in order of decreasing sim
Update:
After you made clear that 60
is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.
Since your vectors are normalized, you can employ kd-tree
to find the neighbors within an MBH
of incremental hypervolume.
No database I'm aware of has native support of kd-tree
, so you can try to implement the following solution in MySQL
, if you are searching for a limited number of closest entries:
2
-dimensional space possible (takes n * (n - 1) / 2
columns)SPATIAL
indexMBR
of a given area within any projection. The product of these MBR
's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.MBR
's using MBRContains
You'll still need to sort within this limited range of values.
For instance, you have a set of 4
-dimensional vectors with magnitude of 2
:
(2, 0, 0, 0)
(1, 1, 1, 1)
(0, 2, 0, 0)
(-2, 0, 0, 0)
You'll have to store them as follows:
p12 p13 p14 p23 p24 p34
--- --- --- --- --- ---
2,0 2,0 2,0 0,0 0,0 0,0
1,1 1,1 1,1 1,1 1,1 1,1
0,2 0,0 0,0 2,0 2,0 0,0
-2,0 -2,0 -2,0 0,0 0,0 0,0
Say, you want similarity with the first vector (2, 0, 0, 0)
greater than 0
.
This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2)
.
You issue the following query:
SELECT *
FROM vectors
WHERE MBRContains('LineFromText(0 -2, 4 2)', p12)
AND MBRContains('LineFromText(0 -2, 4 2)', p13)
…
, etc, for all six columns
If you're willing to live with approximations, there are a few ways you can avoid having to go through the whole database at runtime. In a background job you can start pre-computing pairwise distances between vectors. Doing this for the whole database is a huge computation, but it does not need to be finished for it to be useful (i.e. start computing distances to 100 random vectors for each vector or so. store results in a database).
Then triangulate. if the distance d between your target vector v and some vector v' is large, then the distance between v and all other v'' that are close to v' will be large(-ish) too, so there is no need to compare them anymore (you will have to find acceptable definitions of "large" yourself though). You can experiment with repeating the process for the discarded vectors v'' too, and test how much runtime computation you can avoid before the accuracy starts to drop. (make a test set of "correct" results for comparisons)
good luck.
sds
Uh, no?
You only have to do all 99,999 against the one you picked (rather than all n(n-1)/2
possible pairs), of course, but that's as low as it goes.
Looking at your response to nsanders's answer, it is clear you are already on top of this part. But I've thought of a special case where computing the full set of comparisons might be a win. If:
then you could pre-compute the as the data comes in and just lookup the results per pair at sort time. This might also be effective if you will end up doing many sorts...
Without going trough all entries? It seems not possible. The only thing you can do is to do the math at insert time (remembering that equivalence http://tex.nigma.be/T%2528A%252CB%2529%253DT%2528B%252CA%2529.png :P ).
This avoids your query to check the list against all the other lists at execution time (but it could heavily increase the space needed for the db)