What is wrong with this python function from “Programming Collective Intelligence”?

This is the function in question. It calculates the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.

When I use this with real user data, it sometimes returns a number greater than 1, like in this example:

def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
        if item in prefs[p2]: si[item]=1

    if len(si)==0: return 0

    n=len(si)

    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])

    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 

    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))

    if den==0: return 0

    r=num/den

    return r

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },

    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}

print sim_pearson(critics, 'user1', 'user2', )

1.15470053838

It looks like you may be unexpectedly using integer division. I made the following change and your function returned 1.0:

num=pSum-(1.0*sum1*sum2/n)
den=sqrt((sum1Sq-1.0*pow(sum1,2)/n)*(sum2Sq-1.0*pow(sum2,2)/n))

See PEP 238 for more information on the division operator in Python. An alternate way of fixing your above code is:

from __future__ import division

Well it took me a minute to read over the code but it seems if you change your input data to floats it will work

Integer division is confusing it. It works if you make n a float:

n=float(len(si))

Well, I wasn't exactly able to find what's wrong with the logic in your function, so I just reimplemented it using the definition of Pearson coefficient:

from math import sqrt

def sim_pearson(p1,p2):
    keys = set(p1) | set(p2)
    n = len(keys)

    a1 = sum(p1[it] for it in keys) / n
    a2 = sum(p2[it] for it in keys) / n

#    print(a1, a2)

    sum1Sq = sum((p1[it] - a1) ** 2 for it in keys)
    sum2Sq = sum((p2[it] - a2) ** 2 for it in keys) 

    num = sum((p1[it] - a1) * (p2[it] - a2) for it in keys)
    den = sqrt(sum1Sq * sum2Sq)

#    print(sum1Sq, sum2Sq, num, den)
    return num / den

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },

    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}

assert 0.999 < sim_pearson(critics['user1'], critics['user1']) < 1.0001

print('Your example:', sim_pearson(critics['user1'], critics['user2']))
print('Another example:', sim_pearson({1: 1, 2: 2, 3: 3}, {1: 4, 2: 0, 3: 1}))

Note that in your example the Pearson coefficient is just 1.0 since vectors (-4/3, 2/3, 2/3) and (-2/3, 1/3, 1/3) are parallel.

来源：https://stackoverflow.com/questions/1423525/what-is-wrong-with-this-python-function-from-programming-collective-intelligenc

标签

python

algorithm

pearson