locality-sensitive-hash

Generating Random Hash Functions for LSH Minhash Algorithm

阅读更多关于 Generating Random Hash Functions for LSH Minhash Algorithm

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it? This post was asking

Confusion in hashing used by LSH

阅读更多关于 Confusion in hashing used by LSH

Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?

Search in locality sensitive hashing

阅读更多关于 Search in locality sensitive hashing

问题 I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper: Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu-

Generating Random Hash Functions for LSH Minhash Algorithm

阅读更多关于 Generating Random Hash Functions for LSH Minhash Algorithm

问题 I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient

Confusion in hashing used by LSH

阅读更多关于 Confusion in hashing used by LSH

问题 Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe ( b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar. So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every

How to understand Locality Sensitive Hashing?

阅读更多关于 How to understand Locality Sensitive Hashing?

I noticed that LSH seems a good way to find similar items with high-dimension properties. After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf , I'm still confused with those formulas. Does anyone know a blog or article that explains that the easy way? The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf Also I recommend the below slide: http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf . The example in the slide helps me a

How to understand Locality Sensitive Hashing?

阅读更多关于 How to understand Locality Sensitive Hashing?

问题 I noticed that LSH seems a good way to find similar items with high-dimension properties. After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I\'m still confused with those formulas. Does anyone know a blog or article that explains that the easy way? 回答1: The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf Also I recommend the below slide: