This function is from the book "Programming Collective Intelligence”, and is supposed to calculate the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.
If two critics rate items very similarly the function should return 1, or close to 1.
With real user data I sometimes get weird results. In the following example the dataset critics2 should return 1 - instead it returns 0.
Does anyone spot a mistake?
(This is not a duplicate of What is wrong with this python function from “Programming Collective Intelligence”)
from __future__ import division
from math import sqrt
def sim_pearson(prefs,p1,p2):
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
if len(si)==0: return 0
n=len(si)
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
critics = {
'user1':{
'item1': 3,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 4,
'item2': 5,
'item3': 5,
}
}
critics2 = {
'user1':{
'item1': 5,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 5,
'item3': 5,
}
}
critics3 = {
'user1':{
'item1': 1,
'item2': 3,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 3,
'item3': 1,
}
}
print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)
There is nothing wrong in your result. You are trying to plot a line through 3 points. In second case you have all three points with the same coordinates, i.e. effectively one point. You can't say do these points correlate or anti-correlate, because you can draw infinite number of lines through one point (den
in your code equals to zero).
If you look up Pearson correlation on wikipedia, you'll see that the formula uses the difference between each item in a series and the mean of the series. When all the items in the series are the same, you get division by zero, so your calculation fails.
If it is any clearer, you can use this code:
def simplified_sim_pearson(p1, p2):
n = len(p1)
assert (n != 0)
sum1 = sum(p1)
sum2 = sum(p2)
m1 = float(sum1) / n
m2 = float(sum2) / n
p1mean = [(x - m1) for x in p1]
p2mean = [(y - m2) for y in p2]
numerator = sum(x * y for x, y in zip(p1mean, p2mean))
denominator = math.sqrt(sum(x * x for x in p1mean) * sum(y * y for y in p2mean))
return numerator / denominator if denominator else 0
def sim_pearson(prefs,p1,p2):
p1 = prefs[p1]
p2 = prefs[p2]
si = set(p1.keys()).intersection(set(p2.keys()))
p1_x = [p1[k] for k in sorted(si)]
p2_x = [p2[k] for k in sorted(si)]
return simplified_sim_pearson(p1_x, p2_x)
critics = {
'user1':{
'item1': 3,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 4,
'item2': 5,
'item3': 5,
}
}
critics2 = {
'user1':{
'item1': 5,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 5,
'item3': 5,
}
}
critics3 = {
'user1':{
'item1': 1,
'item2': 3,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 3,
'item3': 1,
}
}
print sim_pearson(critics, 'user1', 'user2', )
print sim_pearson(critics2, 'user1', 'user2', )
print sim_pearson(critics3, 'user1', 'user2', )
By the way, using Excel to determine the correct answer is a good way to validate most calculations. In this case, you would have used correl
.
The algorithm gives the correct result. 0 means that there is no correlation between them (or at least you can't tell from what you know).
Generally (depending on what domain you apply this algorithm) you can consider everything between -0.9 < x < 0.09 as "No correlation observable".
Correlation does not imply causation. Had to say it. You need to develop an understanding of correlation statistics. Correlation can be between -1 and 1 and a value of 0 falls in this range and is a perfectly reasonable result. A correlation of 0 implies that there is not statistically significant relationship between the 2 variables. Remember to avoid doing statistics with less that 30 samples.
来源:https://stackoverflow.com/questions/1778411/what-is-wrong-with-the-pearson-algorithm-from-programming-collective-intelligen