Can you suggest a good minhash implementation?

前端 未结 4 1927
太阳男子
太阳男子 2021-01-31 10:50

I am trying to look for a minhash open source implementation which I can leverage for my work.

The functionality I need is very simple, given a set as input, the impleme

4条回答
  •  滥情空心
    2021-01-31 11:22

    In case you are interested in studying the minhash algorithm, here is a very simple implementation with some discussion.

    To generate a MinHash signature for a set, we create a vector of length $N$ in which all values are set to positive infinity. We also create $N$ functions that take an input integer and permute that value. The $i^{th}$ function will be solely responsible for updating the $i^{th}$value in the vector. Given these values, we can compute the minhash signature of a set by passing each value from the set through each of the $N$ permutation functions. If the output of the $i^{th}$ permutation function is lower than the $i^{th}$ value of the vector, we replace the value with the output from the permutation function (this is why the hash is known as the "min-hash"). Let's implement this in Python:

    from scipy.spatial.distance import cosine
    from random import randint
    import numpy as np
    
    # specify the length of each minhash vector
    N = 128
    max_val = (2**32)-1
    
    # create N tuples that will serve as permutation functions
    # these permutation values are used to hash all input sets
    perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]
    
    # initialize a sample minhash vector of length N
    # each record will be represented by its own vec
    vec = [float('inf') for i in range(N)]
    
    def minhash(s, prime=4294967311):
      '''
      Given a set `s`, pass each member of the set through all permutation
      functions, and set the `ith` position of `vec` to the `ith` permutation
      function's output if that output is smaller than `vec[i]`.
      '''
      # initialize a minhash of length N with positive infinity values
      vec = [float('inf') for i in range(N)]
    
      for val in s:
    
        # ensure s is composed of integers
        if not isinstance(val, int): val = hash(val)
    
        # loop over each "permutation function"
        for perm_idx, perm_vals in enumerate(perms):
          a, b = perm_vals
    
          # pass `val` through the `ith` permutation function
          output = (a * val + b) % prime
    
          # conditionally update the `ith` value of vec
          if vec[perm_idx] > output:
            vec[perm_idx] = output
    
      # the returned vector represents the minimum hash of the set s
      return vec
    

    That's all there is to it! To demonstrate how we might use this implementation, let's take just a simple example:

    import numpy as np
    
    # specify some input sets
    data1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
            'estimating', 'the', 'similarity', 'between', 'datasets'])
    data2 = set(['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
            'estimating', 'the', 'similarity', 'between', 'documents'])
    
    # get the minhash vectors for each input set
    vec1 = minhash(data1)
    vec2 = minhash(data2)
    
    # divide both vectors by their max values to scale values {0:1}
    vec1 = np.array(vec1) / max(vec1)
    vec2 = np.array(vec2) / max(vec2)
    
    # measure the similarity between the vectors using cosine similarity
    print( ' * similarity:', 1 - cosine(vec1, vec2) )
    

    This returns ~.9 as a measurement of the similarity between these vectors.

    While we compare just two minhash vectors above, we can compare them much more simply by using a "Locality Sensitive Hash". To do so, we can build a dictionary that maps each sequence of $W$ MinHash vector components to a unique identifier for the set from which the MinHash vector was constructed. For example, if W = 4 and we have a set S1 from which we derive a MinHash vector [111,512,736,927,817...], we would add the S1 identifier to each sequence of four MinHash values in that vector:

    d[111-512-736-927].append('S1')
    d[512-736-927-817].append('S1')
    ...
    

    Once we do this for all sets, we can examine each key in the dictionary, and if that key has multiple distinct set id's, we have reason to believe those sets are similar. Indeed, the greater the number of times a pair of set id's occurs within a single value in the dictionary, the greater the similarity between the two sets. Processing our data in this way, we can reduce the complexity of identifying all pairs of similar sets to roughly linear time!

提交回复
热议问题