Using document length in the Naive Bayes Classifier of NLTK Python

蓝咒 提交于 2019-12-04 15:45:08

NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).

You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).

Here's one way of going about estimating P(len):

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:

p.cdf(log(L+1)) - p.cdf(log(L))

The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).

There are MultiNomial NaiveBayes algorithms that can handle range values, but not implemented in NLTK. For the NLTK NaiveBayesClassifier, you could try having a couple different length thresholds as binary features. I'd also suggest trying a Maxent Classifier to see how it handles smaller text.
