I wrote a method to calculate the cosine distance between two arrays:
def cosine_distance(a, b):
if len(a) != len(b):
return False
numerator = 0
def cd(a,b):
if(len(a)!=len(b)):
raise ValueError, "a and b must be the same length"
rn = range(len(a))
adb = sum([a[k]*b[k] for k in rn])
nma = sqrt(sum([a[k]*a[k] for k in rn]))
nmb = sqrt(sum([b[k]*b[k] for k in rn]))
result = 1 - adb / (nma*nmb)
return result
(I originally thought) you're not going to speed it up a lot without breaking out to C (like numpy or scipy) or changing what you compute. But here's how I'd try that, anyway:
from itertools import imap
from math import sqrt
from operator import mul
def cosine_distance(a, b):
assert len(a) == len(b)
return 1 - (sum(imap(mul, a, b))
/ sqrt(sum(imap(mul, a, a))
* sum(imap(mul, b, b))))
It's roughly twice as fast in Python 2.6 with 500k-element arrays. (After changing map to imap, following Jarret Hardie.)
Here's a tweaked version of the original poster's revised code:
from itertools import izip
def cosine_distance(a, b):
assert len(a) == len(b)
ab_sum, a_sum, b_sum = 0, 0, 0
for ai, bi in izip(a, b):
ab_sum += ai * bi
a_sum += ai * ai
b_sum += bi * bi
return 1 - ab_sum / sqrt(a_sum * b_sum)
It's ugly, but it does come out faster. . .
Edit: And try Psyco! It speeds up the final version by another factor of 4. How could I forget?