I want to calculate the similarity between two list of words, for example :
[\'email\',\'user\',\'this\',\'email\',\'address\',\'customer\']
is
Since you haven't really been able to demonstrate a crystal output, here is my best shot:
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
In the above two list, we will find the cosine similarity between each element of the list with the rest. i.e. email
from list_B
with every element in list_A
:
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
threshold = 0.80 # if needed
for key in list_A:
for word in list_B:
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
# if res > threshold:
# print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
OUTPUT:
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
Note: I have also commented the
threshold
part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%
EDIT:
OP: but what i want exactly to do in not the comparaison word by word but, list by list
Using Counter
and math
:
from collections import Counter
import math
counterA = Counter(list_A)
counterB = Counter(list_B)
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
print(counter_cosine_similarity(counterA, counterB) * 100)
OUTPUT:
53.03300858899106
You can leverage the power of Scikit-Learn (or other NLP) libraries to accomplish this. The example below uses CountVectorizer, but for more sophisticated analysis of documents it might be preferable to use the TFIDF vectorizer instead.
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def vect_cos(vect, test_list):
""" Vectorise text and compute the cosine similarity """
query_0 = vect.transform([' '.join(vect.get_feature_names())])
query_1 = vect.transform(test_list)
cos_sim = cosine_similarity(query_0.A, query_1.A) # displays the resulting matrix
return query_1, np.round(cos_sim.squeeze(), 3)
# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs
# Analyse list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))
Output
The cosine similarity for the first list is 0.632.
The cosine similarity for the second list is 0.447.
If you want to calculate the cosine similarity between "e-mail" and any other list of strings, train the vectoriser with "e-mail" and then analyse other documents.
# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)
# Analyse list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
Output
The cosine similarity for the first list is 1.0.