问题
I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with details of these packages.
回答1:
I think your question is ambiguous - You might mean to produce a single similarity score for the similarity of the average of list 1 vs the average of list 2. I'm assuming that you want a similarity score for each combination of items from the two lists. For 10K items per list, that will produce 10K pow 2 = 100M similarity scores.
import spacy
spacyModel = spacy.load('en')
list1 = ["hello, example 1", "right, second example"]
list2 = ["hello, example 1 in the second list", "And now for something completely different"]
list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]
similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]
print(similarityMatrix)
[[0.8537950408055295, 0.8852732956832498], [0.5802435148988874, 0.7643245611465626]]
来源:https://stackoverflow.com/questions/53309192/similarity-between-two-lists-of-documents