How to find out wether a word exists in english using nltk

前端未结

关注

 3  2017

I am looking for a proper solution to this question. This question has been asked many times before and i didnt find a single answer that suited. I need to use a corpus in N

相关标签:

3条回答

悲哀的现实

2021-01-02 14:08

NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus, as shown in :

def unusual_words(text):
    text_vocab = set(w.lower() for w in text.split() if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

And in this case you can check the member ship of your word with english_vocab.

>>> import nltk
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> 'a' in english_vocab
True
>>> 'this' in english_vocab
True
>>> 'nothing' in english_vocab
True
>>> 'nothingg' in english_vocab
False
>>> 'corpus' in english_vocab
True
>>> 'Terminology'.lower() in english_vocab
True
>>> 'sorted' in english_vocab
True

0 讨论(0)

别那么骄傲

2021-01-02 14:08
Based on my experience, found two options with NTLK:

1:
```
from nltk.corpus import words

unknown_word = []

if token not in words.words():    
    unknown_word.append(token)
```
2:
```
from nltk.corpus import wordnet

unknown_word = []

if len(wordnet.synsets(token)) == 0:    
    unknown_word.append(token)
```
Performance of option 2 is better. More relevant word got capture in option 2.

I will recommended to go for option 2.
0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2021-01-02 14:22

I tried the above approach but for many words which should exist so I tried wordnet. I think this have more comprehensive vacabulary.-

from nltk.corpus import wordnet if wordnet.synsets(word): #Do something else: #Do some otherthing

0 讨论(0)
发布评论:

提交评论
- 加载中...