I am looking for a proper solution to this question. This question has been asked many times before and i didnt find a single answer that suited. I need to use a corpus in N
NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus, as shown in :
def unusual_words(text):
text_vocab = set(w.lower() for w in text.split() if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab - english_vocab
return sorted(unusual)
And in this case you can check the member ship of your word with english_vocab
.
>>> import nltk
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> 'a' in english_vocab
True
>>> 'this' in english_vocab
True
>>> 'nothing' in english_vocab
True
>>> 'nothingg' in english_vocab
False
>>> 'corpus' in english_vocab
True
>>> 'Terminology'.lower() in english_vocab
True
>>> 'sorted' in english_vocab
True
Based on my experience, found two options with NTLK:
1:
from nltk.corpus import words
unknown_word = []
if token not in words.words():
unknown_word.append(token)
2:
from nltk.corpus import wordnet
unknown_word = []
if len(wordnet.synsets(token)) == 0:
unknown_word.append(token)
Performance of option 2 is better. More relevant word got capture in option 2.
I will recommended to go for option 2.
I tried the above approach but for many words which should exist so I tried wordnet. I think this have more comprehensive vacabulary.-
from nltk.corpus import wordnet
if wordnet.synsets(word):
#Do something
else:
#Do some otherthing