问题
I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1
:
To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes h_l (x).
I have 3 questions:
- The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes
h_l (x)
"? - Always about the bold sentence, I'm not sure that I got the problem: I totally understand that
h_l(x)
can be a long code and so the number of possible buckets can be huge. For example, ifh_l(x)
is a binary code andlength
ish_l(x)
's length, then we have in totalL*2^length
possible buckets (since we useL
hash tables)...is that correct? - Last question: once we find which bucket the query vector
q
belongs to, in order to find the nearest neighbor we have to use the original vectorq
and the original distance metric? For example, let suppose that the original vectorq
is in 128 dimensionsq=[1,0,12,...,14.3]^T
and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensionsy=[0100...11]^T
in order to decide which bucket assignq
to. Soy
has the same index of the bucketB
, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compareq
with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
回答1:
Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.
No of buckets/memory cells [1][2][3] -> O(nL)
References:
[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf
[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf
[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf
来源:https://stackoverflow.com/questions/37783227/non-empty-buckets-in-lsh