问题
I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse.
What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.
I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:
import nltk
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
prob_sum += kneser_ney.prob(i)
print(prob_sum)
The output is "41.51696428571428". Depending on the corpus size, this value grows infinitely large. That makes whatever prob() returns anything but a probability distribution.
Looking at the NLTK code I would say that the implementation is questionable. Maybe I just don't understand how the code is supposed to be used. In that case, could you give me a hint please? In any other case: do you know any working Python implementation? I don't really want to implement it myself.
回答1:
The Kneser-Ney (also have a look at Goodman and Chen for a great survey on different smoothing techniques) is a quite complicated smoothing which only a few package that I am aware of got it right. Not aware of any python implementation, but you can definitely try SRILM if you just need probabilities, etc.
- There is a good chance that your sample has words that didn't occur in training data (aka Out-Of-Vocabulary (OOV) words), which if not handled properly can mess up the probabilities you get. Perhaps this can cause getting outrageously large and invalid prob?
回答2:
I think you are misunderstanding what Kneser-Ney is computing.
From Wikipedia:
The normalizing constant λwi-1 has value chosen carefully to make the sum of conditional probabilities pKN(wi|wi-1) equal to one.
Of course we're talking about bigrams here but the same principal is true for higher order models. Basically what this quote means is that, for a fixed context wi-1 (or more context for higher order models) the probabilities of all wi must add up to one. What you are doing when you add up the probabilities of all samples is including multiple contexts, which is why you end up with a "probability" greater than 1. If you keep the context fixed, as in the following code sample, you end up with a number <= 1.
from nltk.util import ngrams
from nltk.corpus import gutenberg
gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
if i[0] == "I" and i[1] == "confess":
prob_sum += kneser_ney.prob(i)
print "{0}:{1}".format(i, kneser_ney.prob(i))
print prob_sum
The output, based on the NLTK Gutenberg corpus subset, is as follows.
(u'I', u'confess', u'.--'):0.00657894736842
(u'I', u'confess', u'what'):0.00657894736842
(u'I', u'confess', u'myself'):0.00657894736842
(u'I', u'confess', u'also'):0.00657894736842
(u'I', u'confess', u'there'):0.00657894736842
(u'I', u'confess', u',"'):0.0328947368421
(u'I', u'confess', u'that'):0.164473684211
(u'I', u'confess', u'"--'):0.00657894736842
(u'I', u'confess', u'it'):0.0328947368421
(u'I', u'confess', u';'):0.00657894736842
(u'I', u'confess', u','):0.269736842105
(u'I', u'confess', u'I'):0.164473684211
(u'I', u'confess', u'unto'):0.00657894736842
(u'I', u'confess', u'is'):0.00657894736842
0.723684210526
The reason why this sum (.72) is less than 1 is that the probability is calculated only on trigrams appearing in the corpus where the first word is "I" and the second word is "confess." The remaining .28 probability is reserved for wis which do not follow "I" and "confess" in the corpus. This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.
Also doesn't the line
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
compute character trigrams? I think this needs to be tokenized to compute word trigrams.
回答3:
ANSWERING TO YOUR OTHER QUESTION:
In any other case: do you know any working Python implementation?
I just finished a Kneser-Ney implementation in Python. The code is here; there are reports in the README also. Write me for any doubts.
来源:https://stackoverflow.com/questions/35242155/kneser-ney-smoothing-of-trigrams-using-python-nltk