Combining individual probabilities in Naive Bayesian spam filtering

≡放荡痞女 提交于 2019-12-02 20:36:08

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

If your filter is not biased (Pr(S)=Pr(H) = 0.5) then: "It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size."

This means that you should teach your Bayesian filter on the similar amount of spam and ham messages. Say 1000 spam messages and 1000 ham messages.

I'd assume (not checked) that if your filter is biased learning set should conform to the hypothesis about any message being spam.

On the idea of compensating for message lengths, you could estimate for each set the probabilities of a message word being a specific word, then use a poisson distribution to estimate the probability of a message of N words containing that specific word.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!