I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.
I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.
I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:
My PHP code:
public function pSpaminess($word) { $ps = $this->pContentIsSpam(); $ph = $this->pContentIsHam(); $pws = $this->pWordInSpam($word); $pwh = $this->pWordInHam($word); $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph); return $psw; }
In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.
From the wiki formula:
My PHP code:
public function predict($content) { $words = $this->tokenize($content); $pProducts = 1; $pSums = 1; foreach($words as $word) { $p = $this->pSpaminess($word); echo "$word: $p\n"; $pProducts *= $p; $pSums *= (1 - $p); } return $pProducts / ($pProducts + $pSums); }
On a test string "This isn't very bad at all.", the following output is produced:
C:\projects\bayes>php test.php this: 0.19907407407407 isn't: 0.23 very: 0.2 bad: 0.2906976744186 at: 0.17427385892116 all: 0.16098484848485 probability message is spam: float(0.00030795502523944)
Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?
My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.
I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.
If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?
Additional Info
My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.
Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.
+-----------+------------+-----------+ | word | spam_count | ham_count | +-----------+------------+-----------+ | krugman | 30 | 27 | | fetus | 12.5 | 7.5 | | boehner | 12 | 10 | | hatred | 11.5 | 5.5 | | scum | 11 | 10 | | reserve | 11 | 10 | | incapable | 8.5 | 6.5 | | socalled | 8.5 | 5.5 | | jones | 8.5 | 7.5 | | orgasms | 8.5 | 7.5 | +-----------+------------+-----------+
On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.
+------+------------+-----------+ | word | spam_count | ham_count | +------+------------+-----------+ | the | 4884 | 17982 | | to | 4006.5 | 14658.5 | | a | 3770.5 | 14057.5 | | of | 3250.5 | 12102.5 | | and | 3130 | 11709 | | is | 3102.5 | 11032.5 | | i | 2987.5 | 10565.5 | | that | 2953.5 | 10725.5 | | it | 2633 | 9639 | | in | 2593.5 | 9780.5 | +------+------------+-----------+
As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.
As suggested below, a test phrase on a post considered spam rates as follows:
Phrase
Cops are losers in general. That's why they're cops.
Analysis:
C:\projects\bayes>php test.php cops: 0.15833333333333 are: 0.2218958611482 losers: 0.44444444444444 in: 0.20959269435914 general: 0.19565217391304 that's: 0.22080730418068 why: 0.24539170506912 they're: 0.19264544456641 float(6.0865969793861E-5)
According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:
Phrase
Bill and TED's excellent venture?
Analysis
C:\projects\bayes>php test.php bill: 0.19534050179211 and: 0.21093065570456 ted's: 1 excellent: 0.16091954022989 venture: 0.30434782608696 float(1)
Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.
New test on known ham.
Phrase
Complain about $174,000 salary being too little for self. Complain about $50,000 a year too much for teachers. Scumbag congressman.
Analysis
C:\projects\bayes>php test.php complain: 0.19736842105263 about: 0.21896031561847 174: 0.044117647058824 000: 0.19665809768638 salary: 0.20786516853933 being: 0.22011494252874 too: 0.21003236245955 little: 0.21134020618557 for: 0.20980452359022 self: 0.21052631578947 50: 0.19245283018868 a: 0.21149315683195 year: 0.21035386631717 much: 0.20139771283355 teachers: 0.21969696969697 scumbag: 0.22727272727273 congressman: 0.27678571428571 float(3.9604152477223E-11)
Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.
Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?
Final Update
As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.
Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.
Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.
Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)
If your filter is not biased (Pr(S)=Pr(H) = 0.5) then: "It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size."
This means that you should teach your Bayesian filter on the similar amount of spam and ham messages. Say 1000 spam messages and 1000 ham messages.
I'd assume (not checked) that if your filter is biased learning set should conform to the hypothesis about any message being spam.
On the idea of compensating for message lengths, you could estimate for each set the probabilities of a message word being a specific word, then use a poisson distribution to estimate the probability of a message of N words containing that specific word.
来源:https://stackoverflow.com/questions/6463983/combining-individual-probabilities-in-naive-bayesian-spam-filtering