I have a bunch of data, generally in the form a, b, c, ..., y
where y = f(a, b, c...)
Most of them are three and four variables, and have 10k - 10M records.
Do you know to what power you want to limit your polynomial?
If there is no limit, then you can always get an exact match for N points by matching it to a polynomial that has N coefficients. To do this, you plug N different points into your equation, yielding N equations and N unknowns (the coefficients), which you can then use either simple high school algebra or a matrix to solve for the unknowns.
There's a tool for fitting 1D and 2D curves at zunzun.com, but I don't think it goes beyond two variables. Likewise, Matlab doesn't support more than two dimensions fitting (as far as I know) and it's certainly not free.
Otherwise, you might be able to find part of your solution in the Numerical Recipes.
But as other posters indicated, you'll probably need at least a basic idea of your function model (which, hopefully, is linear or can be linearized, in which case you'll have a much larger array of solutions at your disposal)
The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.
Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.
Edit: Octave code for estimating the coefficients for a 2nd order polynomial
x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;
% Add noise to y data
y = y + randn(size(y))*0.1;
% Estimate coefficients of polynomial
p = polyfit(x,y,2)
On my machine, I get:
ans =
5.0886 3.9050 2.9577
If you have a guess at the form of f,[*] you need a minimizer to find the optimal parameters. The tools Scottie T suggests would work, as would ROOT, and many others.
If you don't have a clue what form f might take you are in deep trouble indeed.
[*] That is, you know that
f = f(x,y,z,w,...;p1,p2,p3...)
where the p
s are parameters and the coordinates are x
, y
...
Short Answer: it isn't so simple. Consider a non-parametric approach on data sub-sets.
There are 2 main issues you need to decide about (1) Do you actually care about the parameters of the function, i.e. your P1, E1, ..., or would you be okay with just estimating the mean function (2) do you really need to estimate the function on all of the data?
The first thing I'll mention is that your specified function is non-linear (in the parameters to be estimated), so ordinary least squares won't work. Let's pretend that you specified a linear function. You'd still have a problem with the 10M values. Linear regression can be performed in an efficient way using QR factorization, but you are still left with an O(p * n^2) algorithm, where p is the number of parameters you are trying to estimate. If you want to estimate the non-linear mean function it gets much worse.
The only way you are going to be able to estimate anything in such a large data set is by using a subset to perform the estimation. Basically, you randomly select a subset and use that to estimate the function.
If you don't care about your parameter values, and just want to estimate the mean function you will probably be better off using a non-parametric estimation technique.
Hopefully this helps.
leif
In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:
from numpy import array
from scipy.optimize import leastsq
def __residual(params, y, a, b, c):
p0, e0, p1, e1, p2, e2 = params
return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y
# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual, array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)
If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.