问题
I am comparing the numerical results of C++ and Python computations. In C++, I make use of LAPACK's sgels function to compute the coefficients of a linear regression problem. In Python, I use Numpy's linalg.lstsq function for a similar task.
What is the mathematical difference between the methods used by sgels and linalg.lstsq?
What is the expected error (e.g. 6 significant digits) when comparing the results (i.e. the regression coefficients) numerically?
FYI: I am by no means a C++ or Python expert, which makes it difficult to understand what is going on inside the functions.
回答1:
Taking a look at the source of numpy, in the file linalg.py, lstsq relies on LAPACK's zgelsd()
for complex and dgelsd()
for real. Here are the differences to sgels()
:
dgelsd()
is fordouble
whilesgels()
is forfloat
. There is a difference of precision...dgels()
makes use the QR factorization of the matrix A and assumes that A has full rank. The condition number of the matrix must be reasonable to get a significant result. See this course for getting the logic of the method. On the other hand,dgelsd()
makes use of the Singular value decomposition of A. In particular, A can be rank defiencient and small singular values are discarted depending of the additional argumentrcond
or machine precision. Notice that numpy's default value forrcond
is-1
: negative values refers to machine precision. See this course for the logic.- According to the benchmark of LAPACK, on can expect
dgels()
to be about 5 time faster thandgelsd()
.
You may see significant differences between the result of sgels()
and dgelsd()
if the matrix is ill conditionned. Indeed, there is a bound on the error of the linear regression which depends on the algorithm and the value of rcond()
that is used. See the user guide of LAPACK on, Error Bounds for Linear Least Squares Problems for estimates of the errors and Further Details: Error Bounds for Linear Least Squares Problems for technical details.
As a conclusion, sgels()
and dgels()
can be used if the measures in b
are accurate and easily related to the explanatory variables. For instance, if sensors are placed at the exits of exhaust pipes, it's easy to guess which motors are running. But sometimes, the linear link between the source and the measures is not precisely known (uncertainty on the terms of A) or discriminating polluters on the base of measurements becomes more difficult (Some polluters are far from the set of sensors and A is ill-conditionned). In this kind of situation, dgelsd()
and tunning the rcond
argument can help. Whenever in doubt, use dgelsd()
and estimate the error on the estimated x
according to LAPACK's user guide.
来源:https://stackoverflow.com/questions/41637108/the-difference-between-c-lapack-sgels-and-python-numpy-lstsq-results