问题
Is there a way to equate strings in python based on their meaning despite not being similar. For example,
- temp. Max
- maximum ambient temperature
I've tried using fuzzywuzzy and difflib and although they are generally good for this using token matching, they also provide false positives when I threshold the outputs over a large number of strings. Is there some other method using NLP or tokenization that I'm missing here?
Edit: The answer provided by A CO does solve the problem mentioned above but is there any way to match specific substrings using word2vec from a key? e.g. Key = max temp Sent = the maximum ambient temperature expected tomorrow in California is 34 degrees.
So here I'd like to get the substring "maximum ambient temperature". Any tips on that?
回答1:
As you say, packages like fuzzywuzzy or difflib will be limited because they compute similarities based on the spelling of the strings, not on their meaning.
You could use word embeddings. Word embeddings are vector representations of the words, computed in a way that allows to represent their meaning, to a certain extend.
There are different methods for generating word embeddings, but the most common one is to train a neural network on one - or a set - of word-level NLP tasks, and use the penultimate layer as a representation of the word. This way, the final representation of the word is supposed to have accumulated enough information to complete the task, and this information can be interpreted as an approximation for the meaning of the word. I recommend that you read a bit about Word2vec, which is the method that made word embeddings popular, as it is simple to understand but representative for what word embeddings are. Here is a good introductory article. The similarity between two words can be computed then using usually the cosine distance between their vector representations.
Of course, you don't need to train word embeddings yourself, as there exist plenty of pretrained vectors available (glove, word2vec, fasttext, spacy...). The choice of which embedding you will use depend on the observed performance and your understanding of how fit they are for the task you want to perform. Here is an example with spacy's word vectors, where the sentence vector is computed by averaging the word vectors:
# Importing spacy and fuzzy wuzzy
import spacy
from fuzzywuzzy import fuzz
# Loading spacy's large english model
nlp_model = spacy.load('en_core_web_lg')
s1 = "temp. Max"
s2 = "maximum ambient temperature"
s3 = "the blue cat"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors (The document or sentence vector is the average of the word vectors it contains)
print("Document vectors similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, doc1.similarity(doc2)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, doc1.similarity(doc3)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, doc2.similarity(doc3)))
# Fuzzy logic
print("Character ratio similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))
This will print:
>>> Document vectors similarity between 'temp. Max' and 'maximum ambient temperature' is: 0.6432
>>> Document vectors similarity between 'temp. Max' and 'the blue cat' is: 0.3810
>>> Document vectors similarity between 'maximum ambient temperature' and 'the blue cat' is: 0.3117
>>> Character ratio similarity between 'temp. Max' and 'maximum ambient temperature' is: 28.0000
>>> Character ratio similarity between 'temp. Max' and 'the blue cat' is: 38.0000
>>> Character ratio similarity between 'maximum ambient temperature' and 'the blue cat' is: 21.0000
As you can see, the similarity with word vectors reflects better the similarity in the meaning of the documents.
However this is really just a starting point as there can be plenty of caveats. Here is a list of some of the things you should watch out for:
- Word (and document) vectors do not represent the meaning of the word (or document) per se, they are a way to approximate it. That implies that they will hit a limitation at some point and you cannot take for granted that they will allow you to differentiate all nuances of the language.
- What we expect to be the "similarity in meaning" between two words/sentences varies according to the task we have. As an example, what would be the "ideal" similarity between "maximum temperature" and "minimum temperature"? High because they refer to an extreme state of the same concept, or low because they refer to opposite states of the same concept? With word embeddings, you will usually get a high similarity for these sentences, because as "maximum" and "minimum" often appear in the same contexts the two words will have similar vectors.
- In the example given, 0.6432 is still not a very high similarity. This comes probably from the usage of abbreviated words in the example. Depending on how word embeddings have been generated, they might not handle abbreviation well. In a general manner, it is better to have syntactically and grammatically correct inputs to NLP algorithms. Depending on how your dataset looks like and your knowledge of it, doing some cleaning beforehand can be very helpful. Here is an example with grammatically correct sentences that highlights the similarity in meaning better:
s1 = "The president has given a good speech"
s2 = "Our representative has made a nice presentation"
s3 = "The president ate macaronis with cheese"
doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)
# Word vectors
print(doc1.similarity(doc2))
>>> 0.8779
print(doc1.similarity(doc3))
>>> 0.6131
print(doc2.similarity(doc3))
>>> 0.5771
Anyway, word embeddings are probably what you are looking for but you need to take the time to learn about them. I would recommend that you read about word (and sentence, and document) embeddings and that you play a bit around with different pretrained vectors to get a better understanding of how they can be used for the task you have.
来源:https://stackoverflow.com/questions/63007807/equate-strings-based-on-meaning