Similar images: Bag of Features / Visual Word or matching descriptors?

问题

I have an application where given a reasonable amount of images (let's say 20K) and a query image, I want to find the most similar one. An reasonable approximation is feasible.

In order to guarantee precision in representing each image, I'm using SIFT (a parallel version, to achieve fast computation also).

Now, given the set of n SIFT descriptors (where 500<n<1000 usually, depending on image size), which can be represented as a matrix n x 128, from what I've seen in literature there are two possible approaches for my case:

Descriptors matching: we map each descriptor vector to a low dimension space and we try to find an approximation of the most similar one, for example through LSH. Then, we increment the number of matches between the query image and the image relative to the similar descriptor found. We iterate the process on all the descritors. Finally, we return as result the image with the highest number of descriptors matches.
Bag of Features: we create an histogram vector for each image follow the BoF model. Supposing that we use k-means (where k=128, for example), we obtain a k-dimensions vectors for each image. Since k could be too large for efficient comparison, we can map it to a smaller (possibly binary) space through LSH again (as we did in 1.). Finally, as reslut we return the most similar histogram. Notice that a big problem of this approach is that, as I discussed in this question, in order to quickly define the histogram we need to use LSH again (what a mess!).

I'm surprised that I didn't find any comparison of these two approaches. My question is: what we have to consider for each one of them? There are researches of these two approches? The first method seems more efficient and it's feasible for such a dataset.

来源：https://stackoverflow.com/questions/37987863/similar-images-bag-of-features-visual-word-or-matching-descriptors

标签

image-processing

computer-vision

cluster-analysis

k-means

locality-sensitive-hash