Statistical sentence suggestion model like spell checking

此生再无相见时 提交于 2019-12-03 20:30:03

问题


There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?

Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.


回答1:


What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).

You can consider 2-gram (Markov model):

What is the probability for "kitten" to follow "cute"?

...or 3-gram:

What is the probability for "kitten" to follow "the cute"?

etc.

Obviously training the model with n+1-gram is costlier than n-gram.

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))

You can also try to consider the lemmas instead of the words.

Here some interesting lectures about N-gram modelling:

  1. Introduction to N-grams
  2. Estimating N-gram Probabilities

This is a simple and short example of code (2-gram) not optimized:

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}


来源:https://stackoverflow.com/questions/31827756/statistical-sentence-suggestion-model-like-spell-checking

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!