For one of my course project I started implementing \"Naive Bayesian classifier\" in C. My project is to implement a document classifier application (especially Spam) using huge
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0
and, for p_i = 0.01
, 99
). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0
, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i
, the above method will give you 0/(0+0)
which is NaN
, whereas the proposed method will give you 1/(1+1*...1)
, which is 0.5
.
You can get even better results, when all p_i
are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n
), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i
) with big denominators (big p_(n+1-i)
), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)