问题
I'm reading a paper where they mention that they were able to find the single nearest neighbor in O(1) using a prefix tree. I will describe the general problem and then the classical solution and finally the proposed solution in the paper:
Problem: given a list of bit vectors L (all vectors have the same length) and query bit vector q, we would like to find the nearest neighbor of q. The distance metric is the hamming distance (how many bits are different). The naive approach would be to go through the list and calculate the hamming distance between each vector in the list and q, which will take O(N). However given that we will have millions of bit vectors that is very expensive so we would like to reduce that.
Classical solution: the classical solution to this problem is by using an approximation to find the nearest neighbor so to achieve O(logN). The way to do this is by first sorting L lexicographically so that similar bit vectors will be close to each other. Then given q, we apply binary search on the sorted list to get the position of where q could be in the sorted list and take the vectors above it and below it in the list (since they are similar cuz of the sorting) and calculate the distance between them and pick the one with lowest hamming distance. However just simply doing one sorting we will still miss many similar vectors, so to cover as much similar vectors as possible we use P number of lists and P number of jumbling functions. Each jumbling function corresponds to each list. Then we insert each bit vector to each list in P after jumbling its bits with the corresponding jumbling function. So we end up with P lists each having the bits vectors but with the bits in different order. We again sort each list in P lexicographically. Now given q, we apply the same binary search for each list in P, but here we before apply the jumbling function to q according to which list we are accessing. In this step we get P number of most similar vectors to q, so we finally get the one most similar to q. This way we cover as most similar vectors as we can. By ignoring the time required for sorting, the time needed to locate the nearest neighbor is O(log n) which is the time for the binary search on each list.
Proposed solution: this solution as proposed in the paper (but without any explanation) says that we can get the nearest neighbor in O(1) time using prefix trees. In the paper they said that they use P number of prefix trees and P number of jumbling functions, where each jumbling function corresponds to each tree. Then they insert the bit vectors into each tree after jumbling the bits of each vector with the corresponding jumbling function. Given q, we apply the jumping function to q corresponding to each tree and we retrieve the most similar vector to q from each tree. Now we end up with P bits vectors retrieved from the trees. In the paper they say that just getting the most similar vector to q from a prefix tree is O(1). I really don't understand this at all, as I know searching prefix tree is O(M) where M is the length of the bit vector. Does anybody understand why is it O(1)?
This is the paper I'm referring to (Section 3.3.2): Content-Based Crowd Retrieval on the Real-Time Web
http://students.cse.tamu.edu/kykamath/papers/cikm2012/fp105-kamath.pdf
I also wish if you can answer my other question related to this one: How to lookup the most similar bit vector in an prefix tree for NN-search?
回答1:
I think the argument in the paper is that if it was O(f(x)) then x would have to be the number of items stored in the tree, not the number of dimensions. As you point out, for a prefix tree the time goes up as O(M) where M is the length of the bit vector, but if you reckon M is fixed and what you are interested in is the behaviour as the number of items in the tree increases you have O(1).
By the way, the paper "Fast approximate nearest neighbors with automatic algorithm configuration" by Muja and Lowe also considers tree-based competitors to LSH. The idea here appears to be to randomise the tree construction, create multiple trees, and do a quick but sketchy search of each tree, picking the best answer found in any tree.
回答2:
This is O(1)
only in a very loosely defined sense. In fact I would go so far as to challenge their usage in this case.
From their paper, To determine the nearest neighbor to a user, u
.
- "We first calculate it's signature,
u
" : can beO(1)
depending on the "signature" - "Then for every prefix tree in
P
" : Uh oh, not sounding veryO(1)
,O(P)
would be more correct. - iterative part from 2. "... we find the nearest signature in the prefix tree, by iterating through the tree one level at a time..." : best case
O(d)
whered
is the depth of the tree or length of the word. (this is generous as finding the nearest point in a prefix tree can be more than this) - "After doing this... we end up with
|P|
signatures... of which the smallest hamming distance is picked" : so anotherP
calculations times the length of the word.O(Pd)
.
More correctly the total runtime is O(1) + O(P)+ O(Pd) + O(Pd) = O(Pd)
I believe that @mcdowella is correct in his analysis of how they try to make this O(1)
, but from what I've read they haven't convinced me.
回答3:
I assume they have a reference to P's node in a tree, and can navigate to the next or previous entry in O(1) amortized time. i.e. the trick is to have access to the underlying nodes.
来源:https://stackoverflow.com/questions/17282026/finding-the-single-nearest-neighbor-using-a-prefix-tree-in-o1