Efficient comparison of 100.000 vectors

后端 未结 10 2030
礼貌的吻别
礼貌的吻别 2021-01-31 21:20

I save 100.000 Vectors of in a database. Each vector has a dimension 60. (int vector[60])

Then I take one and want present vectors to the user in order of decreasing sim

相关标签:
10条回答
  • 2021-01-31 21:59

    Update:

    After you made clear that 60 is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.


    Since your vectors are normalized, you can employ kd-tree to find the neighbors within an MBH of incremental hypervolume.

    No database I'm aware of has native support of kd-tree, so you can try to implement the following solution in MySQL, if you are searching for a limited number of closest entries:

    • Store the projections of the vectors to each of 2-dimensional space possible (takes n * (n - 1) / 2 columns)
    • Index each of these columns with a SPATIAL index
    • Pick a square MBR of a given area within any projection. The product of these MBR's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.
    • Find all projections within all MBR's using MBRContains

    You'll still need to sort within this limited range of values.

    For instance, you have a set of 4-dimensional vectors with magnitude of 2:

    (2, 0, 0, 0)
    (1, 1, 1, 1)
    (0, 2, 0, 0)
    (-2, 0, 0, 0)
    

    You'll have to store them as follows:

    p12  p13  p14  p23  p24  p34
    ---  ---  ---  ---  ---  ---
    2,0  2,0  2,0  0,0  0,0  0,0
    1,1  1,1  1,1  1,1  1,1  1,1
    0,2  0,0  0,0  2,0  2,0  0,0
    -2,0 -2,0 -2,0 0,0  0,0  0,0
    

    Say, you want similarity with the first vector (2, 0, 0, 0) greater than 0.

    This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2).

    You issue the following query:

    SELECT  *
    FROM    vectors
    WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
            AND MBRContains('LineFromText(0 -2, 4 2)', p13)
            …
    

    , etc, for all six columns

    0 讨论(0)
  • 2021-01-31 22:00

    If you're willing to live with approximations, there are a few ways you can avoid having to go through the whole database at runtime. In a background job you can start pre-computing pairwise distances between vectors. Doing this for the whole database is a huge computation, but it does not need to be finished for it to be useful (i.e. start computing distances to 100 random vectors for each vector or so. store results in a database).

    Then triangulate. if the distance d between your target vector v and some vector v' is large, then the distance between v and all other v'' that are close to v' will be large(-ish) too, so there is no need to compare them anymore (you will have to find acceptable definitions of "large" yourself though). You can experiment with repeating the process for the discarded vectors v'' too, and test how much runtime computation you can avoid before the accuracy starts to drop. (make a test set of "correct" results for comparisons)

    good luck.

    sds

    0 讨论(0)
  • 2021-01-31 22:00

    Uh, no?

    You only have to do all 99,999 against the one you picked (rather than all n(n-1)/2 possible pairs), of course, but that's as low as it goes.


    Looking at your response to nsanders's answer, it is clear you are already on top of this part. But I've thought of a special case where computing the full set of comparisons might be a win. If:

    • the list comes in slowly (say your getting them from some data acquisition system at a fixed, low rate)
    • you don't know until the end which one you want to compare to
    • you have plenty of storage
    • you need the answer fast when you do pick one (and the naive approach isn't fast enough)
    • Looks are faster than computing

    then you could pre-compute the as the data comes in and just lookup the results per pair at sort time. This might also be effective if you will end up doing many sorts...

    0 讨论(0)
  • 2021-01-31 22:02

    Without going trough all entries? It seems not possible. The only thing you can do is to do the math at insert time (remembering that equivalence http://tex.nigma.be/T%2528A%252CB%2529%253DT%2528B%252CA%2529.png :P ).

    This avoids your query to check the list against all the other lists at execution time (but it could heavily increase the space needed for the db)

    0 讨论(0)
提交回复
热议问题