问题
I am a heavy R user and am recently learning python. I have a question about how statsmodels.api handles duplicated features. In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE? Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5). Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved. Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
回答1:
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit
is iteratively reweighted least squares irls
which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.
来源:https://stackoverflow.com/questions/37472963/statmodels-in-python-package-how-exactly-duplicated-features-are-handled