custom made regression using average absolute deviation

问题

Following this post, I now have serious doubts if R-squared or F-test are good indications of a good linear fit into some data with random noise. Hence, I want to develop a custom made regression function so I can both learn how it works and maybe improve upon the existing tools.

Consider these randomly generated ndarrays x and y:

import numpy as np

np.random.seed(42)

x = np.random.rand(30) * 10
y = 1.5 * x + 0.3 + (np.random.rand(30) - 0.5) * 3.5

now I can define the average/mean absolute deviation of any set of data points with:

def aad(X, Y, a, b): # assumes X and Y are of the identical shape/size
    n = X.size # highly unsafe!
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

which in my opinion is the best way to quantify the fitness of a line of y = a * x + b into the pair of data points. The function simply finds the closest point the assumed line to any data point and then calculates the perpendicular distance between the point and the line.

Now I need to have a function of let's say:

linearFit(X, Y)

which given the identically shaped ndarrays of X and Y, finds the a and b which make the aad(X, Y, a, b) minimum. It is important that the result to be an absolute minimum not just a local one.

Of course in the spirit of SO's best practices, I have already tried the scipy.optimize functions fmin and brute, as you may see in the above-mentioned post and also here. However, it seems that I can't get my head around the right syntax for those functions. I would appreciate it if you could help me find a canonical and performant implementation for the presumed linearFit function. Thanks for your support in advance.

P.S. A temporary workaround offered here:

from scipy.optimize import minimize

aad_ = lambda P: aad(P[0], P[1], x1, y1)
minimize(aad_, x0=[X0, Y0])

however, the results I'm getting are not that promising! The solver does not succeed and I get the message:

Desired error not necessarily achieved due to precision loss

回答1:

First of all, thanks to this post I realized that this is not an ordinary least squares (OLS) regression as was discussed in the comments above. It is actually called by many names among which Deming regression, orthogonal distance regression (ODR), and total least squares (TLS). Also there is, of course, a Python package scipy.odr for that as well! Its syntax is a bit weird and the documentation is not much of a help, but a good tutorial can be found here.

Nex I found a small bug in the aad definition and renamed and fixed it to:

def aaod(a, b, X, Y): # assumes X and Y are of the identical shape/size
    n = X.size # still highly unsafe! don't use it in real production
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

standing for average absolute orthogonal distance. Now defining our fitting function as:

from scipy.optimize import minimize
from scipy.stats import linregress

def odrFit(X, Y):
    X0 = linregress(X, Y) # wait this is cheating!
    aaod_ = lambda P: aaod(P[0], P[1], X, Y)
    res = minimize(aaod_, x0=X0[:2], method = 'Nelder-Mead')
    res_list = res.x.tolist()
    res_list.append(aaod_(res_list))
    return res_list

which is not necessarily the most performant and canonical implementation. The workaround with the temporary lambda function I learned from here and the method = 'Nelder-Mead' from here. The scipy.odr implementation can also be done as:

from scipy.odr import Model, ODR, RealData

def f(B, x):
    return B[0]*x + B[1]

linear = Model(f)
mydata = RealData(x, y)
myodr = ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()

Now comparing the result between our custom-made odrFit() function and scipy.stats.linregress():

slope, intercept, r_value, p_value, std_err = linregress(x,y)

print(*odrFit(x, y)) 
# --> 1.4804181575739097, 0.47304584702448255, 0.6008218016339527

print(slope, intercept, aaod(slope, intercept, x, y))
# --> 1.434483032725671 0.5747705643012724 0.608021569291401

print(*myoutput.beta, aaod(*myoutput.beta, x, y))
# --> 1.5118079563432785 0.23562547897245803 0.6055838996104704

which shows our function is actually more accurate than the least absolute deviation regression method of Scipy. This can actually be just pure luck and more tests need to be done to draw a reliable conclusion. The complete code can be found here.

来源：https://stackoverflow.com/questions/60386406/custom-made-regression-using-average-absolute-deviation

标签

python

numpy

scipy

regression

scipy-optimize