locality-sensitive-hash

How to bucket locality-sensitive hashes?

阅读更多关于 How to bucket locality-sensitive hashes?

问题 I already have the algorithm to produce locality-sensitive hashes, but how should I bucket them to take advantage of their characteristics(i.e. similar elements have near hashes(with the hamming distance))? In the matlab code I found they simply create a distance matrix between the hashes of the points to search and the hashes of the points in the database, to simplify the code,while referencing a so called Charikar method for an actually good implementation of the search method. I tried to

Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

阅读更多关于 Two algorithms to find nearest neighbor with Locality-sensitive hashing, which one?

问题 Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this: 1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use

Is LSH about transforming vectors to binary vectors for hamming distance?

阅读更多关于 Is LSH about transforming vectors to binary vectors for hamming distance?

问题 I read some paper about LSH and I know that is used for solving the approximated k-NN problem. We can divide the algorithm in two parts: Given a vector in D dimensions (where D is big) of any value, translate it with a set of N (where N<<D ) hash functions to a binary vector in N dimensions. Using hamming distance, apply some search technique on the set of given binary codes obtained from phase 1 to find the k-NN. The keypoint is that computing the hamming distance for vectors in N dimensions

Approximate String Matching using LSH

阅读更多关于 Approximate String Matching using LSH

问题 I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold. That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside

Number of buckets in LSH

阅读更多关于 Number of buckets in LSH

问题 In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly. For 40.000 documents, what is a good value (pretty much) for the number of buckets? I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more. Any ideas, please ? Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)? 回答1: A common starting

Similar images: Bag of Features / Visual Word or matching descriptors?

阅读更多关于 Similar images: Bag of Features / Visual Word or matching descriptors?

问题 I have an application where given a reasonable amount of images (let's say 20K) and a query image, I want to find the most similar one. An reasonable approximation is feasible. In order to guarantee precision in representing each image, I'm using SIFT (a parallel version, to achieve fast computation also). Now, given the set of n SIFT descriptors (where 500<n<1000 usually, depending on image size), which can be represented as a matrix n x 128 , from what I've seen in literature there are two

Non-empty buckets in LSH

阅读更多关于 Non-empty buckets in LSH

问题 I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1 : To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too

Global vector descriptor

阅读更多关于 Global vector descriptor

问题 Usually, algorithms as SIFT, SURF and many others provdies a set of k keypoints and the associated descriptor in d dimension (for example, in SIFT each descriptor has d=128 dimensions). So, in order to describe an image we need a matrix kxd ( k descriptor vectors, each one in d dimensions). So far so good. My question is: how can we describe an image through a single vector? This could be really useful since we could save a lot of space and because certain algorithms (like LSH) requires a

What is the ε (epsilon) parameter in Locality Sensitive Hashing (LSH)?

阅读更多关于 What is the ε (epsilon) parameter in Locality Sensitive Hashing (LSH)?

问题 I've read the original paper about Locality Sensitive Hashing. The complexity is in function of the parameter ε, but I don't understand what it is. Can you explain its meaning please? 回答1: ε is the approximation parameter . LSH (as FLANN & kd-GeRaF) is designed for high dimensional data. In that space, k-NN doesn't work well, in fact it is almost as slow as brute force, because of the curse of dimensionality. For that reason, we focus on solving the aproximate k-NN. Check Definition 1 from

LSH Spark stucks forever at approxSimilarityJoin() function

阅读更多关于 LSH Spark stucks forever at approxSimilarityJoin() function

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this. MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features") .setOutputCol("hashes"); MinHashLSHModel model = mh.fit(dataset); Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance"); approxSimilarityJoin.show(); The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve