Statistical sentence suggestion model like spell checking

坚强是说给别人听的谎言 提交于 2019-11-30 17:04:43

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).

You can consider 2-gram (Markov model):

What is the probability for "kitten" to follow "cute"?

...or 3-gram:

What is the probability for "kitten" to follow "the cute"?

etc.

Obviously training the model with n+1-gram is costlier than n-gram.

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))

You can also try to consider the lemmas instead of the words.

Here some interesting lectures about N-gram modelling:

  1. Introduction to N-grams
  2. Estimating N-gram Probabilities

This is a simple and short example of code (2-gram) not optimized:

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!