Differences in Linear Regression in R and Python [closed]

问题

I was trying to match the linear regression R results with that of python

Matching the coefficients for each of independent variable

and below is the code:

Data is uploaded.

https://www.dropbox.com/s/oowe4irm9332s78/X.csv?dl=0

https://www.dropbox.com/s/79scp54unzlbwyk/Y.csv?dl=0

R code:

#define pathname = " "

X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F)
Y <- read.csv(file.path(pathname,"Y.csv"),stringsAsFactors = F)

d1 = Y$d1

reg_data <- cbind(d1, X)
head(reg_data)

reg_model <- lm(d1 ~ .,data=reg_data)

names(which(is.na(reg_model$coefficients)))

reg_model$coefficients

R Result

> summary(reg_model)

Call:
lm(formula = d1 ~ ., data = reg_data)

Residuals:
ALL 60 residuals are 0: no residual degrees of freedom!

Coefficients: (14 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.37752         NA      NA       NA
v1            1.30214         NA      NA       NA
v2           -2.93118         NA      NA       NA
v3            7.58902         NA      NA       NA
v4           11.88570         NA      NA       NA
v5            1.60622         NA      NA       NA
v6            3.71528         NA      NA       NA
v7           -9.34627         NA      NA       NA
v8           -3.84694         NA      NA       NA
v9           -2.51332         NA      NA       NA
v10           4.22403         NA      NA       NA
v11          -9.70126         NA      NA       NA
v12                NA         NA      NA       NA
v13           4.67276         NA      NA       NA
v14          -6.57924         NA      NA       NA
v15          -3.68065         NA      NA       NA
v16           5.25168         NA      NA       NA
v17          14.60444         NA      NA       NA
v18          16.00679         NA      NA       NA
v19          24.79622         NA      NA       NA
v20          13.85774         NA      NA       NA
v21           2.16022         NA      NA       NA
v22         -36.65361         NA      NA       NA
v23           2.26554         NA      NA       NA
v24                NA         NA      NA       NA
v25                NA         NA      NA       NA
v26           7.00981         NA      NA       NA
v27           0.88904         NA      NA       NA
v28           0.34400         NA      NA       NA
v29          -5.27597         NA      NA       NA
v30           5.21034         NA      NA       NA
v31           6.79640         NA      NA       NA
v32           2.96346         NA      NA       NA
v33          -1.52702         NA      NA       NA
v34          -2.74632         NA      NA       NA
v35          -2.36952         NA      NA       NA
v36          -7.76547         NA      NA       NA
v37           2.19630         NA      NA       NA
v38           1.63336         NA      NA       NA
v39           0.69485         NA      NA       NA
v40           0.37379         NA      NA       NA
v41          -0.09107         NA      NA       NA
v42           2.06569         NA      NA       NA
v43           1.57505         NA      NA       NA
v44           2.70535         NA      NA       NA
v45           1.17634         NA      NA       NA
v46         -10.51141         NA      NA       NA
v47          -1.15060         NA      NA       NA
v48           2.87353         NA      NA       NA
v49           3.37740         NA      NA       NA
v50          -5.89816         NA      NA       NA
v51           0.85851         NA      NA       NA
v52           3.73929         NA      NA       NA
v53           4.93265         NA      NA       NA
v54           3.45650         NA      NA       NA
v55           0.12382         NA      NA       NA
v56          -0.21171         NA      NA       NA
v57           4.37199         NA      NA       NA
v58           3.21456         NA      NA       NA
v59           0.09012         NA      NA       NA
v60          -0.85414         NA      NA       NA
v61          -3.29856         NA      NA       NA
v62           4.38842         NA      NA       NA
v63                NA         NA      NA       NA
v64                NA         NA      NA       NA
v65                NA         NA      NA       NA
v66                NA         NA      NA       NA
v67                NA         NA      NA       NA
v68                NA         NA      NA       NA
v69                NA         NA      NA       NA
v70                NA         NA      NA       NA
v71                NA         NA      NA       NA
v72                NA         NA      NA       NA
v73                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 59 and 0 DF,  p-value: NA

Python Code:

Y = pd.read_csv(pathname+"Y.csv")
X = pd.read_csv(pathname+"X.csv")

lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lr.fit(X, Y['d1'])

(list(zip(lr.coef_, X)))

lr.intercept_

Python Result:

intercept = 29.396033164254106

[(-2.4463986167806304, 'v1'),
 (-1.6293010275307021, 'v2'),
 (0.89089949009506508, 'v3'),
 (-3.1021251646895251, 'v4'),
 (-1.7707078771936109, 'v5'),
 (-2.0474705122225636, 'v6'),
 (-1.5537181337496202, 'v7'),
 (-1.6391241229716156, 'v8'),
 (-1.2981646048517046, 'v9'),
 (0.89221826294889328, 'v10'),
 (-0.56694104645951571, 'v11'),
 (2.042810365310288e-14, 'v12'),
 (-2.0312478672439052, 'v13'),
 (-1.5617121392788413, 'v14'),
 (0.4583365939498274, 'v15'),
 (0.8840538748922967, 'v16'),
 (-5.5952681002058871, 'v17'),
 (2.4937042448512892, 'v18'),
 (0.45806845189176543, 'v19'),
 (-1.1648810657830406, 'v20'),
 (-1.7800004329275585, 'v21'),
 (-5.0132817522704816, 'v22'),
 (3.6862778096189266, 'v23'),
 (2.7533531010703882e-14, 'v24'),
 (1.2150003225741557e-14, 'v25'),
 (0.94669823515018103, 'v26'),
 (-0.3082823207975679, 'v27'),
 (0.53619247380957358, 'v28'),
 (-1.1339902793546781, 'v29'),
 (1.9657159583080186, 'v30'),
 (-0.63200501460653324, 'v31'),
 (1.4741013580918978, 'v32'),
 (-2.4448418291953313, 'v33'),
 (-2.0787115960875036, 'v34'),
 (0.22492914212063603, 'v35'),
 (-0.75136276693004922, 'v36'),
 (1.2838658951186761, 'v37'),
 (0.5816277993227944, 'v38'),
 (-0.11270569554555088, 'v39'),
 (-0.13430982360936233, 'v40'),
 (-3.3189296496897662, 'v41'),
 (-0.452575588270415, 'v42'),
 (6.1329755709937519, 'v43'),
 (0.61559185634634817, 'v44'),
 (-1.206315459828555, 'v45'),
 (-3.7452010299772009, 'v46'),
 (-1.1472174665136678, 'v47'),
 (2.8960489381172652, 'v48'),
 (0.0090220136972478659, 'v49'),
 (-5.264918363314754, 'v50'),
 (1.2194758337662015, 'v51'),
 (2.78655271320092, 'v52'),
 (3.106513852668896, 'v53'),
 (3.5181252502607929, 'v54'),
 (-0.34426523770507278, 'v55'),
 (-0.48792823932479878, 'v56'),
 (0.12284460490031779, 'v57'),
 (1.6860388628044991, 'v58'),
 (1.2823067194737174, 'v59'),
 (2.8352263554153665, 'v60'),
 (-1.304336378501032, 'v61'),
 (0.55226132316435139, 'v62'),
 (1.5416988124754771, 'v63'),
 (-0.2605804175310813, 'v64'),
 (1.2489066081702334, 'v65'),
 (-0.44469553013696161, 'v66'),
 (-1.4102990055550157, 'v67'),
 (3.8150423259022639, 'v68'),
 (0.12039684410168072, 'v69'),
 (-1.340699466779357, 'v70'),
 (1.7066389124439212, 'v71'),
 (0.50470752944860442, 'v72'),
 (1.0024872633969766, 'v73')]

But it is not matching.Please help.

Note:It matched for below example

http://davidcoallier.com/blog/linear-regression-from-r-to-python/

回答1:

tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.

For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).

The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:

Rank deficient matrices will not raise an error.

Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.

The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.

The underlying code has this comment/explanation:

c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.

c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.

This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)

Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr, so that this information will actually be available in the documentation in the next release of R ...

If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f and translate the guts of lm.fit into Python: this is the core of lm.fit, minus error-checking and other processing ...

z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p) 
    (z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ... 
}  else {
    coef[r2] <- NA
    if (z$pivoted) 
        coef[pivot] <- coef
    names(coef) <- dn
    names(z$effects) <- nmeffects
}
z$coefficients <- coef

回答2:

As a complement to Ben Bolker's answer.

The main problem is what a statistical package should do with ill-posed problems. As far as I have seen, there are large differences across packages in how singular and almost singular design matrices are handled. The only deterministic way is if a user chooses explicitly a variable selection or penalization algorithm.

If the rank can be clearly identified, then the outcome is still deterministic but varies by chosen singularity policy. Stata and R drop variables, Stata drops them in sequence as the variables are listed, i.e. drop the last collinear variable. I don't know which variables R drops. statsmodels uses a symmetric handling of variables and drops singular values by using a generalized inverse, pinv, based on singular value decomposition. This corresponds to a tiny penalization as in PCA reduced rank regression or Ridge regression.

The result is not deterministic, i.e. depends on the linear algebra package and might not even be the same on different computers, if numerical noise affects the identification of the rank or of collinear variables. Specifically, rank revealing QR and pinv/SVD need thresholds below which the rank is identified as reduced. In statsmodels, the default threshold from numpy for the relative condition number is around 1e-15. So we get a regularized solution if singular values are smaller than that. But the threshold might be too low in some cases and the "non-singular" solution is dominated by numerical noise which cannot be replicated. I guess the same will be true for any rank revealing QR or other purely numerical solution to the collinearity problem.
(Robustness issue of statsmodel Linear regression (ols) - Python
https://github.com/statsmodels/statsmodels/issues/2628
and related
statmodels in python package, How exactly duplicated features are handled?)

About rank revealing, pivoting QR:

It was not available in scipy when most of statsmodels.OLS was written, but it has been available in scipy, as pivoting keyword, for some time now, but has not yet been added as an option to statsmodels.OLS https://github.com/scipy/scipy/pull/44

I'm skeptical about it as a default solution, because I guess, without ever having verified it, that numerical noise will affect which variables are pivoted. Then, it will not be deterministic anymore which variables will be dropped. In my opinion, variable selection should be a conscious choice by the user and not left to purely numerical choices.

(disclaimer: I'm a statsmodels maintainer)

edit:

The question uses scikit-learn in the example.

As far as I can see, scikit-learn uses a different LAPACK function to solve the linear regression problem, but it is also based on a singular value decomposition like statsmodels. However, scikit-learn uses currently the default threshold of the corresponding scipy function which is smaller than the numpy function that statsmodels uses.
e.g. how does sklearn do Linear regression when p >n?

So, I expect that scikit-learn and statsmodels have the same results in the singular case, but the results will differ in some near singular cases.

来源：https://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python

标签

python

scikit-learn

linear-regression