You are looking for NumPy (matrix manipulation and number-crunching) and SciPy (optimization).
To get started, see https://stackoverflow.com/questions/4375094/numpy-learning-resources
I worked out the given example as follows:
- I opened the sample Excel files in OpenOffice
- I copied the team-data (without headers) to a new sheet and saved as teams.csv
- I copied the game-data (without headers) to a new sheet and saved as games.csv
then in Python:
import csv
import numpy
import scipy.optimize
def readCsvFile(fname):
with open(fname, 'r') as inf:
return list(csv.reader(inf))
# Get team data
team = readCsvFile('teams.csv') # list of num,name
numTeams = len(team)
# Get game data
game = readCsvFile('games.csv') # list of game,home,away,homescore,awayscore
numGames = len(game)
# Now, we have the NFL teams for 2002 and data on all games played.
# From this, we wish to forecast the score of future games.
# We are going to assume that each team has an inherent performance-factor,
# and that there is a bonus for home-field advantage; then the
# relative final score between a home team and an away team can be
# calculated as (home advantage) + (home team factor) - (away team factor)
# First we create a matrix M which will hold the data on
# who played whom in each game and who had home-field advantage.
m_rows = numTeams + 1
m_cols = numGames
M = numpy.zeros( (m_rows, m_cols) )
# Then we create a vector S which will hold the final
# relative scores for each game.
s_cols = numGames
S = numpy.zeros(s_cols)
# Loading M and S with game data
for col,gamedata in enumerate(game):
gameNum,home,away,homescore,awayscore = gamedata
# In the csv data, teams are numbered starting at 1
# So we let home-team advantage be 'team 0' in our matrix
M[0, col] = 1.0 # home team advantage
M[int(home), col] = 1.0
M[int(away), col] = -1.0
S[col] = int(homescore) - int(awayscore)
# Now, if our theoretical model is correct, we should be able
# to find a performance-factor vector W such that W*M == S
#
# In the real world, we will never find a perfect match,
# so what we are looking for instead is W which results in S'
# such that the least-mean-squares difference between S and S'
# is minimized.
# Initial guess at team weightings:
# 2.0 points home-team advantage, and all teams equally strong
init_W = numpy.array([2.0]+[0.0]*numTeams)
def errorfn(w,m,s):
return w.dot(m) - s
W = scipy.optimize.leastsq(errorfn, init_W, args=(M,S))
homeAdvantage = W[0][0] # 2.2460937500005356
teamStrength = W[0][1:] # numpy.array([-151.31111318, -136.36319652, ... ])
# Team strengths have meaning only by linear comparison;
# we can add or subtract any constant to all of them without
# changing the meaning.
# To make them easier to understand, we want to shift them
# such that the average is 0.0
teamStrength -= teamStrength.mean()
for t,s in zip(team,teamStrength):
print "{0:>10}: {1: .7}".format(t[1],s)
results in
Ari: -9.8897569
Atl: 5.0581597
Balt: -2.1178819
Buff: -0.27413194
Carolina: -3.2720486
Chic: -5.2654514
Cinn: -10.503646
Clev: 1.2338542
Dall: -8.4779514
Den: 4.8901042
Det: -9.1727431
GB: 3.5800347
Hous: -9.4390625
Indy: 1.1689236
Jack: -0.2015625
KC: 6.1112847
Miami: 6.0588542
Minn: -3.0092014
NE: 4.0262153
NO: 2.4251736
NYG: 0.82725694
NYJ: 3.1689236
Oak: 10.635243
Phil: 8.2987847
Pitt: 2.6994792
St. Louis: -3.3352431
San Diego: -0.72065972
SF: 0.63524306
Seattle: -1.2512153
Tampa: 8.8019097
Tenn: 1.7640625
Wash: -4.4529514
which is the same result shown in the spreadsheet.