minhash | 易学教程

Locality-sensitive hashing - Elasticsearch

阅读更多关于 Locality-sensitive hashing - Elasticsearch

问题 is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates? 回答1: There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later. Install MinHash plugin: $ $ES_HOME/bin/plugin

Can you suggest a good minhash implementation?

阅读更多关于 Can you suggest a good minhash implementation?

I am trying to look for a minhash open source implementation which I can leverage for my work. The functionality I need is very simple, given a set as input, the implementation should return its minhash. A python or C implementation would be preferred, just in case I need to hack it to work for me. Any pointers would be of great help. Regards. You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash: lsh LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering MinHash

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

Choosing between SimHash and MinHash for a production system

阅读更多关于 Choosing between SimHash and MinHash for a production system

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use. I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words). Specific language implementation aside, which algorithm would be best for a greenfield production system? Simhash is faster (very fast) and

Choosing between SimHash and MinHash for a production system

阅读更多关于 Choosing between SimHash and MinHash for a production system

问题 I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use. I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words). Specific language implementation aside,

Generating Random Hash Functions for LSH Minhash Algorithm

阅读更多关于 Generating Random Hash Functions for LSH Minhash Algorithm

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it? This post was asking

Generating Random Hash Functions for LSH Minhash Algorithm

阅读更多关于 Generating Random Hash Functions for LSH Minhash Algorithm

问题 I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it. Is this an efficient