locality-sensitive-hash

Generating Random Hash Functions for LSH Minhash Algorithm

寵の児 提交于 2019-11-29 02:40:25
I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it? This post was asking

Confusion in hashing used by LSH

不羁的心 提交于 2019-11-28 14:21:15
Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?

Search in locality sensitive hashing

杀马特。学长 韩版系。学妹 提交于 2019-11-28 09:35:13
问题 I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper: Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu-

Generating Random Hash Functions for LSH Minhash Algorithm

六月ゝ 毕业季﹏ 提交于 2019-11-27 16:58:29
问题 I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient

Confusion in hashing used by LSH

不问归期 提交于 2019-11-27 08:37:23
问题 Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every

How to understand Locality Sensitive Hashing?

半世苍凉 提交于 2019-11-26 14:50:24
I noticed that LSH seems a good way to find similar items with high-dimension properties. After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf , I'm still confused with those formulas. Does anyone know a blog or article that explains that the easy way? The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf Also I recommend the below slide: http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf . The example in the slide helps me a

How to understand Locality Sensitive Hashing?

你离开我真会死。 提交于 2019-11-26 04:02:05
问题 I noticed that LSH seems a good way to find similar items with high-dimension properties. After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I\'m still confused with those formulas. Does anyone know a blog or article that explains that the easy way? 回答1: The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf Also I recommend the below slide: