Using document length in the Naive Bayes Classifier of NLTK Python

北城以北 提交于 2019-12-06 11:55:35

问题


I am building a spam filter using the NLTK in Python. I now check for the occurances of words and use the NaiveBayesClassifier resulting in an accuracy of .98 and F measure for spam of .92 and for non-spam: 0.98. However when checking the documents in which my program errors I notice that a lot of spam that is classified as non-spam are very short messages.

So I want to put the length of a document as a feature for the NaiveBayesClassifier. The problem is it now only handles binary values. Is there any other way to do this than for example say: length<100 =true/false?

(p.s. I have build the spam detector analogous to the http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html example)


回答1:


NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).

You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).

Here's one way of going about estimating P(len):

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:

p.cdf(log(L+1)) - p.cdf(log(L))

The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).




回答2:


There are MultiNomial NaiveBayes algorithms that can handle range values, but not implemented in NLTK. For the NLTK NaiveBayesClassifier, you could try having a couple different length thresholds as binary features. I'd also suggest trying a Maxent Classifier to see how it handles smaller text.



来源:https://stackoverflow.com/questions/5248100/using-document-length-in-the-naive-bayes-classifier-of-nltk-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!