Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.
If a word use
The other answer shows you how to use synonyms:
wn.synsets('small') [Synset('small.n.01'), Synset('small.n.02'), Synset('small.a.01'), Synset('minor.s.10'), Synset('little.s.03'), Synset('small.s.04'), Synset('humble.s.01'), Synset('little.s.07'), Synset('little.s.05'), Synset('small.s.08'), Synset('modest.s.02'), Synset('belittled.s.01'), Synset('small.r.01')]
You now know how to get all the synonyms for a word. That's not the hard part. The hard part is determining what's the most common synonym. This question is highly domain dependent. Most common synonym where? In literature? In common vernacular? In technical speak?
Like you, I wanted to get an idea of how the English language was used. I downloaded 15,000 entire books from (Project Gutenberg) and processed the word and letter pair frequencies on all of them. After ingesting such a large corpus, I could see which words were used most commonly. Like I said above, though, it will depend on what you're trying to process. If it's something like Twitter posts, try ingesting a ton of tweets. Learn from what you have to eventually process.
Synonyms are a huge and open area of work in natural language processing.
In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.
Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.
Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).
I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!
Edit: I should have included this link to an older question as well:
How to get synonyms from nltk WordNet Python
And the NLTK documentation on its interface with WordNet:
http://www.nltk.org/howto/wordnet.html
I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.
Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.
The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:
from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())
You can then look up the frequency of a word like this:
>>> print(freqs["valued"])
14
Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n
, v
, a
, and r
, resp. noun
, verb
, adjective
and adverb
), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.
>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in
brown.tagged_words(tagset="universal"))
>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45