Calculating the similarity of two lists

前端 未结 6 1290
野趣味
野趣味 2020-12-24 03:55

I have two lists:

eg. a = [1,8,3,9,4,9,3,8,1,2,3] and b = [1,8,1,3,9,4,9,3,8,1,2,3]

Both contain ints. There is no meaning behind the ints (eg. 1 is not \'cl

相关标签:
6条回答
  • 2020-12-24 04:03

    It sounds like edit (or Levenshtein) distance is precisely the right tool for the job.

    Here is one Python implementation that can be used on lists of integers: http://hetland.org/coding/python/levenshtein.py

    Using that code, levenshtein([1,8,3,9,4,9,3,8,1,2,3], [1,8,1,3,9,4,9,3,8,1,2,3]) returns 1, which is the edit distance.

    Given the edit distance and the lengths of the two arrays, computing a "percentage similarity" metric should be pretty trivial.

    0 讨论(0)
  • 2020-12-24 04:06

    One way to tackle this is to utilize histogram. As an example (demonstration with numpy):

    In []: a= array([1,8,3,9,4,9,3,8,1,2,3])
    In []: b= array([1,8,1,3,9,4,9,3,8,1,2,3])
    
    In []: a_c, _= histogram(a, arange(9)+ 1)
    In []: a_c
    Out[]: array([2, 1, 3, 1, 0, 0, 0, 4])
    
    In []: b_c, _= histogram(b, arange(9)+ 1)
    In []: b_c
    Out[]: array([3, 1, 3, 1, 0, 0, 0, 4])
    
    In []: (a_c- b_c).sum()
    Out[]: -1
    

    There exist now plethora of ways to harness a_c and b_c.

    Where the (seemingly) simplest similarity measure is:

    In []: 1- abs(-1/ 9.)
    Out[]: 0.8888888888888888
    

    Followed by:

    In []: norm(a_c)/ norm(b_c)
    Out[]: 0.92796072713833688
    

    and:

    In []: a_n= (a_c/ norm(a_c))[:, None]
    In []: 1- norm(b_c- dot(dot(a_n, a_n.T), b_c))/ norm(b_c)
    Out[]: 0.84445724579043624
    

    Thus, you need to be much more specific to find out most suitable similarity measure suitable for your purposes.

    0 讨论(0)
  • 2020-12-24 04:06

    Unless im missing the point.

    from __future__ import division
    
    def similar(x,y):
        si = 0
        for a,b in zip(x, y):
            if a == b:
                si += 1
        return (si/len(x)) * 100
    
    
    if __name__ in '__main__':
        a = [1,8,3,9,4,9,3,8,1,2,3] 
        b = [1,8,1,3,9,4,9,3,8,1,2,3]
        result = similar(a,b)
        if result is not None:
            print "%s%s Similar!" % (result,'%')
    
    0 讨论(0)
  • 2020-12-24 04:09

    Just use the same algorithm for calculating edit distance on strings if the values don't have any particular meaning.

    0 讨论(0)
  • 2020-12-24 04:09

    I've implemented something for a similar task a long time ago. Now, I have only a blog entry for that. It was simple: you had to compute the pdf of both sequences then it would find the common area covered by the graphical representation of pdf.

    Sorry for the broken images on link, the external server that I've used back then is dead now.

    Right now, for your problem the code translates to

    def overlap(pdf1, pdf2):
        s = 0
        for k in pdf1:
            if pdf2.has_key(k):
                s += min(pdf1[k], pdf2[k])
        return s
    
    def pdf(l):
        d = {}
        s = 0.0
        for i in l:
            s += i
            if d.has_key(i):
                d[i] += 1
            else:
                d[i] = 1
        for k in d:
            d[k] /= s
        return d
    
    def solve():
        a = [1, 8, 3, 9, 4, 9, 3, 8, 1, 2, 3]
        b = [1, 8, 1, 3, 9, 4, 9, 3, 8, 1, 2, 3]
        pdf_a = pdf(a)
        pdf_b = pdf(b)
        print pdf_a
        print pdf_b
        print overlap(pdf_a, pdf_b)
        print overlap(pdf_b, pdf_a)
    
    if __name__ == '__main__':
        solve()
    

    Unfortunately, it gives an unexpected answer, only 0.212292609351

    0 讨论(0)
  • 2020-12-24 04:23

    You can use the difflib module

    ratio()
    Return a measure of the sequences’ similarity as a float in the range [0, 1].

    Which gives :

     >>> s1=[1,8,3,9,4,9,3,8,1,2,3]
     >>> s2=[1,8,1,3,9,4,9,3,8,1,2,3]
     >>> sm=difflib.SequenceMatcher(None,s1,s2)
     >>> sm.ratio()
     0.9565217391304348
    
    0 讨论(0)
提交回复
热议问题