Differences in Linear Regression in R and Python [closed]

╄→尐↘猪︶ㄣ 提交于 2019-12-06 11:06:10

tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.

For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).

The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:

  • Rank deficient matrices will not raise an error.
  • Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.

The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.

The underlying code has this comment/explanation:

c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.

c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.

This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)

Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr, so that this information will actually be available in the documentation in the next release of R ...

If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f and translate the guts of lm.fit into Python: this is the core of lm.fit, minus error-checking and other processing ...

z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p) 
    (z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ... 
}  else {
    coef[r2] <- NA
    if (z$pivoted) 
        coef[pivot] <- coef
    names(coef) <- dn
    names(z$effects) <- nmeffects
}
z$coefficients <- coef
Josef

As a complement to Ben Bolker's answer.

The main problem is what a statistical package should do with ill-posed problems. As far as I have seen, there are large differences across packages in how singular and almost singular design matrices are handled. The only deterministic way is if a user chooses explicitly a variable selection or penalization algorithm.

If the rank can be clearly identified, then the outcome is still deterministic but varies by chosen singularity policy. Stata and R drop variables, Stata drops them in sequence as the variables are listed, i.e. drop the last collinear variable. I don't know which variables R drops. statsmodels uses a symmetric handling of variables and drops singular values by using a generalized inverse, pinv, based on singular value decomposition. This corresponds to a tiny penalization as in PCA reduced rank regression or Ridge regression.

The result is not deterministic, i.e. depends on the linear algebra package and might not even be the same on different computers, if numerical noise affects the identification of the rank or of collinear variables. Specifically, rank revealing QR and pinv/SVD need thresholds below which the rank is identified as reduced. In statsmodels, the default threshold from numpy for the relative condition number is around 1e-15. So we get a regularized solution if singular values are smaller than that. But the threshold might be too low in some cases and the "non-singular" solution is dominated by numerical noise which cannot be replicated. I guess the same will be true for any rank revealing QR or other purely numerical solution to the collinearity problem.
(Robustness issue of statsmodel Linear regression (ols) - Python
https://github.com/statsmodels/statsmodels/issues/2628
and related
statmodels in python package, How exactly duplicated features are handled?)

About rank revealing, pivoting QR:

It was not available in scipy when most of statsmodels.OLS was written, but it has been available in scipy, as pivoting keyword, for some time now, but has not yet been added as an option to statsmodels.OLS https://github.com/scipy/scipy/pull/44

I'm skeptical about it as a default solution, because I guess, without ever having verified it, that numerical noise will affect which variables are pivoted. Then, it will not be deterministic anymore which variables will be dropped. In my opinion, variable selection should be a conscious choice by the user and not left to purely numerical choices.

(disclaimer: I'm a statsmodels maintainer)

edit:

The question uses scikit-learn in the example.

As far as I can see, scikit-learn uses a different LAPACK function to solve the linear regression problem, but it is also based on a singular value decomposition like statsmodels. However, scikit-learn uses currently the default threshold of the corresponding scipy function which is smaller than the numpy function that statsmodels uses.
e.g. how does sklearn do Linear regression when p >n?

So, I expect that scikit-learn and statsmodels have the same results in the singular case, but the results will differ in some near singular cases.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!