Calculate similarity between list of words

前端 未结 2 1280
暗喜
暗喜 2021-01-03 15:01

I want to calculate the similarity between two list of words, for example :

[\'email\',\'user\',\'this\',\'email\',\'address\',\'customer\']

is

相关标签:
2条回答
  • 2021-01-03 15:34

    Since you haven't really been able to demonstrate a crystal output, here is my best shot:

    list_A = ['email','user','this','email','address','customer']
    list_B = ['email','mail','address','netmail']
    

    In the above two list, we will find the cosine similarity between each element of the list with the rest. i.e. email from list_B with every element in list_A:

    def word2vec(word):
        from collections import Counter
        from math import sqrt
    
        # count the characters in word
        cw = Counter(word)
        # precomputes a set of the different characters
        sw = set(cw)
        # precomputes the "length" of the word vector
        lw = sqrt(sum(c*c for c in cw.values()))
    
        # return a tuple
        return cw, sw, lw
    
    def cosdis(v1, v2):
        # which characters are common to the two words?
        common = v1[1].intersection(v2[1])
        # by definition of cosine distance we have
        return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
    
    
    list_A = ['email','user','this','email','address','customer']
    list_B = ['email','mail','address','netmail']
    
    threshold = 0.80     # if needed
    for key in list_A:
        for word in list_B:
            try:
                # print(key)
                # print(word)
                res = cosdis(word2vec(word), word2vec(key))
                # print(res)
                print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
                # if res > threshold:
                #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
            except IndexError:
                pass
    

    OUTPUT:

    The cosine similarity between : email and : email is: 100.0
    The cosine similarity between : mail and : email is: 89.44271909999159
    The cosine similarity between : address and : email is: 26.967994498529684
    The cosine similarity between : netmail and : email is: 84.51542547285166
    The cosine similarity between : email and : user is: 22.360679774997898
    The cosine similarity between : mail and : user is: 0.0
    The cosine similarity between : address and : user is: 60.30226891555272
    The cosine similarity between : netmail and : user is: 18.89822365046136
    The cosine similarity between : email and : this is: 22.360679774997898
    The cosine similarity between : mail and : this is: 25.0
    The cosine similarity between : address and : this is: 30.15113445777636
    The cosine similarity between : netmail and : this is: 37.79644730092272
    The cosine similarity between : email and : email is: 100.0
    The cosine similarity between : mail and : email is: 89.44271909999159
    The cosine similarity between : address and : email is: 26.967994498529684
    The cosine similarity between : netmail and : email is: 84.51542547285166
    The cosine similarity between : email and : address is: 26.967994498529684
    The cosine similarity between : mail and : address is: 15.07556722888818
    The cosine similarity between : address and : address is: 100.0
    The cosine similarity between : netmail and : address is: 22.79211529192759
    The cosine similarity between : email and : customer is: 31.62277660168379
    The cosine similarity between : mail and : customer is: 17.677669529663685
    The cosine similarity between : address and : customer is: 42.640143271122085
    The cosine similarity between : netmail and : customer is: 40.08918628686365
    

    Note: I have also commented the threshold part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%

    EDIT:

    OP: but what i want exactly to do in not the comparaison word by word but, list by list

    Using Counter and math:

    from collections import Counter
    import math
    
    counterA = Counter(list_A)
    counterB = Counter(list_B)
    
    
    def counter_cosine_similarity(c1, c2):
        terms = set(c1).union(c2)
        dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
        magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
        magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
        return dotprod / (magA * magB)
    
    print(counter_cosine_similarity(counterA, counterB) * 100)
    

    OUTPUT:

    53.03300858899106
    
    0 讨论(0)
  • 2021-01-03 15:48

    You can leverage the power of Scikit-Learn (or other NLP) libraries to accomplish this. The example below uses CountVectorizer, but for more sophisticated analysis of documents it might be preferable to use the TFIDF vectorizer instead.

    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    def vect_cos(vect, test_list):
        """ Vectorise text and compute the cosine similarity """
        query_0 = vect.transform([' '.join(vect.get_feature_names())])
        query_1 = vect.transform(test_list)
        cos_sim = cosine_similarity(query_0.A, query_1.A)  # displays the resulting matrix
        return query_1, np.round(cos_sim.squeeze(), 3)
    
    # Train the vectorizer
    vocab=['email','user','this','email','address','customer']
    vectoriser = CountVectorizer().fit(vocab)
    vectoriser.vocabulary_ # show the word-matrix position pairs
    
    # Analyse  list_1
    list_1 = ['email','mail','address','netmail']
    list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
    
    # Analyse list_2
    list_2 = ['address','ip','network']
    list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])
    
    print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
    print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))
    

    Output

    The cosine similarity for the first list is 0.632.
    
    The cosine similarity for the second list is 0.447.
    

    Edit

    If you want to calculate the cosine similarity between "e-mail" and any other list of strings, train the vectoriser with "e-mail" and then analyse other documents.

    # Train the vectorizer
    vocab=['email']
    vectoriser = CountVectorizer().fit(vocab)
    
    # Analyse  list_1
    list_1 =['email','mail','address','netmail']
    list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
    print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
    

    Output

    The cosine similarity for the first list is 1.0.
    
    0 讨论(0)
提交回复
热议问题