I am trying to look for a minhash open source implementation which I can leverage for my work.
The functionality I need is very simple, given a set as input, the impleme
In case you are interested in studying the minhash algorithm, here is a very simple implementation with some discussion.
To generate a MinHash signature for a set, we create a vector of length $N$
in which all values are set to positive infinity. We also create $N$
functions that take an input integer and permute that value. The $i^{th}$
function will be solely responsible for updating the $i^{th}$
value in the vector. Given these values, we can compute the minhash signature of a set by passing each value from the set through each of the $N$
permutation functions. If the output of the $i^{th}$
permutation function is lower than the $i^{th}$
value of the vector, we replace the value with the output from the permutation function (this is why the hash is known as the "min-hash"). Let's implement this in Python:
from scipy.spatial.distance import cosine
from random import randint
import numpy as np
# specify the length of each minhash vector
N = 128
max_val = (2**32)-1
# create N tuples that will serve as permutation functions
# these permutation values are used to hash all input sets
perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]
# initialize a sample minhash vector of length N
# each record will be represented by its own vec
vec = [float('inf') for i in range(N)]
def minhash(s, prime=4294967311):
'''
Given a set `s`, pass each member of the set through all permutation
functions, and set the `ith` position of `vec` to the `ith` permutation
function's output if that output is smaller than `vec[i]`.
'''
# initialize a minhash of length N with positive infinity values
vec = [float('inf') for i in range(N)]
for val in s:
# ensure s is composed of integers
if not isinstance(val, int): val = hash(val)
# loop over each "permutation function"
for perm_idx, perm_vals in enumerate(perms):
a, b = perm_vals
# pass `val` through the `ith` permutation function
output = (a * val + b) % prime
# conditionally update the `ith` value of vec
if vec[perm_idx] > output:
vec[perm_idx] = output
# the returned vector represents the minimum hash of the set s
return vec
That's all there is to it! To demonstrate how we might use this implementation, let's take just a simple example:
import numpy as np
# specify some input sets
data1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'datasets'])
data2 = set(['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents'])
# get the minhash vectors for each input set
vec1 = minhash(data1)
vec2 = minhash(data2)
# divide both vectors by their max values to scale values {0:1}
vec1 = np.array(vec1) / max(vec1)
vec2 = np.array(vec2) / max(vec2)
# measure the similarity between the vectors using cosine similarity
print( ' * similarity:', 1 - cosine(vec1, vec2) )
This returns ~.9 as a measurement of the similarity between these vectors.
While we compare just two minhash vectors above, we can compare them much more simply by using a "Locality Sensitive Hash". To do so, we can build a dictionary that maps each sequence of $W$ MinHash vector components to a unique identifier for the set from which the MinHash vector was constructed. For example, if W = 4
and we have a set S1
from which we derive a MinHash vector [111,512,736,927,817...]
, we would add the S1
identifier to each sequence of four MinHash values in that vector:
d[111-512-736-927].append('S1')
d[512-736-927-817].append('S1')
...
Once we do this for all sets, we can examine each key in the dictionary, and if that key has multiple distinct set id's, we have reason to believe those sets are similar. Indeed, the greater the number of times a pair of set id's occurs within a single value in the dictionary, the greater the similarity between the two sets. Processing our data in this way, we can reduce the complexity of identifying all pairs of similar sets to roughly linear time!
Take a look at the datasketch library. It supports serialization and merging. It is implemented in pure python with no external dependency. The Go version has the exact same functionalities.
I would suggest you this library, especially if you need persistence. Here, you can use redis to store/retrieve all your data.
You have the option to select a redis database, or to simply use built-in in-memory python dictionaries.
Performances using redis, at least if redis server is running on your local machine, are almost identical to those achieved via standard python dictionaries.
You only need to specify a config dictionary such as
config = {"redis": {"host": 'localhost', "port": '6739', "db": 0}}
and pass it as an argument to the LSHash
class constructor.
You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash:
lsh
LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering
MinHash