问题
I was trying to match the linear regression R results with that of python
Matching the coefficients for each of independent variable
and below is the code:
Data is uploaded.
https://www.dropbox.com/s/oowe4irm9332s78/X.csv?dl=0
https://www.dropbox.com/s/79scp54unzlbwyk/Y.csv?dl=0
R code:
#define pathname = " "
X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F)
Y <- read.csv(file.path(pathname,"Y.csv"),stringsAsFactors = F)
d1 = Y$d1
reg_data <- cbind(d1, X)
head(reg_data)
reg_model <- lm(d1 ~ .,data=reg_data)
names(which(is.na(reg_model$coefficients)))
reg_model$coefficients
R Result
> summary(reg_model)
Call:
lm(formula = d1 ~ ., data = reg_data)
Residuals:
ALL 60 residuals are 0: no residual degrees of freedom!
Coefficients: (14 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.37752 NA NA NA
v1 1.30214 NA NA NA
v2 -2.93118 NA NA NA
v3 7.58902 NA NA NA
v4 11.88570 NA NA NA
v5 1.60622 NA NA NA
v6 3.71528 NA NA NA
v7 -9.34627 NA NA NA
v8 -3.84694 NA NA NA
v9 -2.51332 NA NA NA
v10 4.22403 NA NA NA
v11 -9.70126 NA NA NA
v12 NA NA NA NA
v13 4.67276 NA NA NA
v14 -6.57924 NA NA NA
v15 -3.68065 NA NA NA
v16 5.25168 NA NA NA
v17 14.60444 NA NA NA
v18 16.00679 NA NA NA
v19 24.79622 NA NA NA
v20 13.85774 NA NA NA
v21 2.16022 NA NA NA
v22 -36.65361 NA NA NA
v23 2.26554 NA NA NA
v24 NA NA NA NA
v25 NA NA NA NA
v26 7.00981 NA NA NA
v27 0.88904 NA NA NA
v28 0.34400 NA NA NA
v29 -5.27597 NA NA NA
v30 5.21034 NA NA NA
v31 6.79640 NA NA NA
v32 2.96346 NA NA NA
v33 -1.52702 NA NA NA
v34 -2.74632 NA NA NA
v35 -2.36952 NA NA NA
v36 -7.76547 NA NA NA
v37 2.19630 NA NA NA
v38 1.63336 NA NA NA
v39 0.69485 NA NA NA
v40 0.37379 NA NA NA
v41 -0.09107 NA NA NA
v42 2.06569 NA NA NA
v43 1.57505 NA NA NA
v44 2.70535 NA NA NA
v45 1.17634 NA NA NA
v46 -10.51141 NA NA NA
v47 -1.15060 NA NA NA
v48 2.87353 NA NA NA
v49 3.37740 NA NA NA
v50 -5.89816 NA NA NA
v51 0.85851 NA NA NA
v52 3.73929 NA NA NA
v53 4.93265 NA NA NA
v54 3.45650 NA NA NA
v55 0.12382 NA NA NA
v56 -0.21171 NA NA NA
v57 4.37199 NA NA NA
v58 3.21456 NA NA NA
v59 0.09012 NA NA NA
v60 -0.85414 NA NA NA
v61 -3.29856 NA NA NA
v62 4.38842 NA NA NA
v63 NA NA NA NA
v64 NA NA NA NA
v65 NA NA NA NA
v66 NA NA NA NA
v67 NA NA NA NA
v68 NA NA NA NA
v69 NA NA NA NA
v70 NA NA NA NA
v71 NA NA NA NA
v72 NA NA NA NA
v73 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 59 and 0 DF, p-value: NA
Python Code:
Y = pd.read_csv(pathname+"Y.csv")
X = pd.read_csv(pathname+"X.csv")
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lr.fit(X, Y['d1'])
(list(zip(lr.coef_, X)))
lr.intercept_
Python Result:
intercept = 29.396033164254106
[(-2.4463986167806304, 'v1'),
(-1.6293010275307021, 'v2'),
(0.89089949009506508, 'v3'),
(-3.1021251646895251, 'v4'),
(-1.7707078771936109, 'v5'),
(-2.0474705122225636, 'v6'),
(-1.5537181337496202, 'v7'),
(-1.6391241229716156, 'v8'),
(-1.2981646048517046, 'v9'),
(0.89221826294889328, 'v10'),
(-0.56694104645951571, 'v11'),
(2.042810365310288e-14, 'v12'),
(-2.0312478672439052, 'v13'),
(-1.5617121392788413, 'v14'),
(0.4583365939498274, 'v15'),
(0.8840538748922967, 'v16'),
(-5.5952681002058871, 'v17'),
(2.4937042448512892, 'v18'),
(0.45806845189176543, 'v19'),
(-1.1648810657830406, 'v20'),
(-1.7800004329275585, 'v21'),
(-5.0132817522704816, 'v22'),
(3.6862778096189266, 'v23'),
(2.7533531010703882e-14, 'v24'),
(1.2150003225741557e-14, 'v25'),
(0.94669823515018103, 'v26'),
(-0.3082823207975679, 'v27'),
(0.53619247380957358, 'v28'),
(-1.1339902793546781, 'v29'),
(1.9657159583080186, 'v30'),
(-0.63200501460653324, 'v31'),
(1.4741013580918978, 'v32'),
(-2.4448418291953313, 'v33'),
(-2.0787115960875036, 'v34'),
(0.22492914212063603, 'v35'),
(-0.75136276693004922, 'v36'),
(1.2838658951186761, 'v37'),
(0.5816277993227944, 'v38'),
(-0.11270569554555088, 'v39'),
(-0.13430982360936233, 'v40'),
(-3.3189296496897662, 'v41'),
(-0.452575588270415, 'v42'),
(6.1329755709937519, 'v43'),
(0.61559185634634817, 'v44'),
(-1.206315459828555, 'v45'),
(-3.7452010299772009, 'v46'),
(-1.1472174665136678, 'v47'),
(2.8960489381172652, 'v48'),
(0.0090220136972478659, 'v49'),
(-5.264918363314754, 'v50'),
(1.2194758337662015, 'v51'),
(2.78655271320092, 'v52'),
(3.106513852668896, 'v53'),
(3.5181252502607929, 'v54'),
(-0.34426523770507278, 'v55'),
(-0.48792823932479878, 'v56'),
(0.12284460490031779, 'v57'),
(1.6860388628044991, 'v58'),
(1.2823067194737174, 'v59'),
(2.8352263554153665, 'v60'),
(-1.304336378501032, 'v61'),
(0.55226132316435139, 'v62'),
(1.5416988124754771, 'v63'),
(-0.2605804175310813, 'v64'),
(1.2489066081702334, 'v65'),
(-0.44469553013696161, 'v66'),
(-1.4102990055550157, 'v67'),
(3.8150423259022639, 'v68'),
(0.12039684410168072, 'v69'),
(-1.340699466779357, 'v70'),
(1.7066389124439212, 'v71'),
(0.50470752944860442, 'v72'),
(1.0024872633969766, 'v73')]
But it is not matching.Please help.
Note:It matched for below example
http://davidcoallier.com/blog/linear-regression-from-r-to-python/
回答1:
tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.
For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).
The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:
- Rank deficient matrices will not raise an error.
- Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.
The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.
The underlying code has this comment/explanation:
c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.
This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)
Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr
, so that this information will actually be available in the documentation in the next release of R ...
If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f
and translate the guts of lm.fit
into Python: this is the core of lm.fit
, minus error-checking and other processing ...
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p)
(z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ...
} else {
coef[r2] <- NA
if (z$pivoted)
coef[pivot] <- coef
names(coef) <- dn
names(z$effects) <- nmeffects
}
z$coefficients <- coef
回答2:
As a complement to Ben Bolker's answer.
The main problem is what a statistical package should do with ill-posed problems. As far as I have seen, there are large differences across packages in how singular and almost singular design matrices are handled. The only deterministic way is if a user chooses explicitly a variable selection or penalization algorithm.
If the rank can be clearly identified, then the outcome is still deterministic but varies by chosen singularity policy. Stata and R drop variables, Stata drops them in sequence as the variables are listed, i.e. drop the last collinear variable. I don't know which variables R drops. statsmodels uses a symmetric handling of variables and drops singular values by using a generalized inverse, pinv, based on singular value decomposition. This corresponds to a tiny penalization as in PCA reduced rank regression or Ridge regression.
The result is not deterministic, i.e. depends on the linear algebra package and might not even be the same on different computers, if numerical noise affects the identification of the rank or of collinear variables.
Specifically, rank revealing QR and pinv/SVD need thresholds below which the rank is identified as reduced.
In statsmodels, the default threshold from numpy for the relative condition number is around 1e-15. So we get a regularized solution if singular values are smaller than that. But the threshold might be too low in some cases and the "non-singular" solution is dominated by numerical noise which cannot be replicated. I guess the same will be true for any rank revealing QR or other purely numerical solution to the collinearity problem.
(Robustness issue of statsmodel Linear regression (ols) - Python
https://github.com/statsmodels/statsmodels/issues/2628
and related
statmodels in python package, How exactly duplicated features are handled?)
About rank revealing, pivoting QR:
It was not available in scipy when most of statsmodels.OLS was written, but it has been available in scipy, as pivoting
keyword, for some time now, but has not yet been added as an option to statsmodels.OLS https://github.com/scipy/scipy/pull/44
I'm skeptical about it as a default solution, because I guess, without ever having verified it, that numerical noise will affect which variables are pivoted. Then, it will not be deterministic anymore which variables will be dropped. In my opinion, variable selection should be a conscious choice by the user and not left to purely numerical choices.
(disclaimer: I'm a statsmodels maintainer)
edit:
The question uses scikit-learn in the example.
As far as I can see, scikit-learn uses a different LAPACK function to solve the linear regression problem, but it is also based on a singular value decomposition like statsmodels. However, scikit-learn uses currently the default threshold of the corresponding scipy function which is smaller than the numpy function that statsmodels uses.
e.g. how does sklearn do Linear regression when p >n?
So, I expect that scikit-learn and statsmodels have the same results in the singular case, but the results will differ in some near singular cases.
来源:https://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python