nltk language model (ngram) calculate the prob of a word from context

后端 未结 4 1344
遇见更好的自我
遇见更好的自我 2020-12-13 21:14

I am using Python and NLTK to build a language model as follows:

from nltk.corpus import brown
from nltk.probability         


        
4条回答
  •  有刺的猬
    2020-12-13 21:59

    I know this question is old but it pops up every time I google nltk's NgramModel class. NgramModel's prob implementation is a little unintuitive. The asker is confused. As far as I can tell, the answers aren't great. Since I don't use NgramModel often, this means I get confused. No more.

    The source code lives here: https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py. Here is the definition of NgramModel's prob method:

    def prob(self, word, context):
        """
        Evaluate the probability of this word in this context using Katz Backoff.
    
        :param word: the word to get the probability of
        :type word: str
        :param context: the context the word is in
        :type context: list(str)
        """
    
        context = tuple(context)
        if (context + (word,) in self._ngrams) or (self._n == 1):
            return self[context].prob(word)
        else:
            return self._alpha(context) * self._backoff.prob(word, context[1:])
    

    (note: 'self[context].prob(word) is equivalent to 'self._model[context].prob(word)')

    Okay. Now at least we know what to look for. What does context need to be? Let's look at an excerpt from the constructor:

    for sent in train:
        for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
            self._ngrams.add(ngram)
            context = tuple(ngram[:-1])
            token = ngram[-1]
            cfd[context].inc(token)
    
    if not estimator_args and not estimator_kwargs:
        self._model = ConditionalProbDist(cfd, estimator, len(cfd))
    else:
        self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)
    

    Alright. The constructor creates a conditional probability distribution (self._model) out of a conditional frequency distribution whose "context" is tuples of unigrams. This tells us 'context' should not be a string or a list with a single multi-word string. 'context' MUST be something iterable containing unigrams. In fact, the requirement is a little more strict. These tuples or lists must be of size n-1. Think of it this way. You told it to be a trigram model. You better give it the appropriate context for trigrams.

    Let's see this in action with a simpler example:

    >>> import nltk
    >>> obs = 'the rain in spain falls mainly in the plains'.split()
    >>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
    >>> lm.prob('rain', 'the') #wrong
    0.0
    >>> lm.prob('rain', ['the']) #right
    0.5
    >>> lm.prob('spain', 'rain in') #wrong
    0.0
    >>> lm.prob('spain', ['rain in']) #wrong
    '''long exception'''
    >>> lm.prob('spain', ['rain', 'in']) #right
    1.0
    

    (As a side note, actually trying to do anything with MLE as your estimator in NgramModel is a bad idea. Things will fall apart. I guarantee it.)

    As for the original question, I suppose my best guess at what OP wants is this:

    print lm.prob("word", "generates a".split())
    print lm.prob("b", "generates a".split())
    

    ...but there are so many misunderstandings going on here that I can't possible tell what he was actually trying to do.

提交回复
热议问题