问题
I expect LASSO with no penalization ($\lambda=0$) to yield the same (or very similar) coefficient estimates as an OLS fit. However, I get different coefficient estimates in R putting the same data (x,y) into
glmnet(x, y , alpha=1, lambda=0)
for LASSO fit with no penalization andlm(y ~ x)
for OLS fit.
Why is that?
回答1:
You're using the function wrong. The x
should be the model matrix. Not the raw predictor value. When you do that, you get the exact same results:
x <- rnorm(500)
y <- rnorm(500)
mod1 <- lm(y ~ x)
xmm <- model.matrix(mod1)
mod2 <- glmnet(xmm, y, alpha=1, lambda=0)
coef(mod1)
coef(mod2)
回答2:
I have run with the "prostate" example dataset of Hastie's book the next code:
out.lin1 = lm( lpsa ~ . , data=yy )
out.lin1$coeff
out.lin2 = glmnet( as.matrix(yy[ , -9]), yy$lpsa, family="gaussian", lambda=0, standardize=T )
coefficients(out.lin2)
and the result of the coefficients are similar. When we use the standardize option the returned coefficients by glmnet() are in the original units of the input variables. Please, check you are using the "gaussian" family
回答3:
I had the same problem, asked around to no avail, then I emailed the package maintainer (Trevor Hastie) who gave the answer. The problem occurs when series are highly correlated. The solution is to decrease the threshold in the glmnet()
function call (rather than via glmnet.control()
). The code below uses the built-in dataset EuStockMarkets
and applies a VAR with lambda=0
. For XSMI, the OLS coefficient is below 1, the default glmnet
coefficient is above 1 with a difference of about 0.03, and the glmnet
coefficient with thresh=1e-14
is very close to the OLS coefficient (a difference of 1.8e-7
).
# Use built-in panel data with integrated series
data("EuStockMarkets")
selected_market <- 2
# Take logs for good measure
EuStockMarkets <- log(EuStockMarkets)
# Get dimensions
num_entities <- dim(EuStockMarkets)[2]
num_observations <- dim(EuStockMarkets)[1]
# Build the response with the most recent observations at the top
Y <- as.matrix(EuStockMarkets[num_observations:2, selected_market])
X <- as.matrix(EuStockMarkets[(num_observations - 1):1, ])
# Run OLS, which adds an intercept by default
ols <- lm(Y ~ X)
ols_coef <- coef(ols)
# run glmnet with lambda = 0
fit <- glmnet(y = Y, x = X, lambda = 0)
lasso_coef <- coef(fit)
# run again, but with a stricter threshold
fit_threshold <- glmnet(y = Y, x = X, lambda = 0, thresh = 1e-14)
lasso_threshold_coef <- coef(fit_threshold)
# build a dataframe to compare the two approaches
comparison <- data.frame(ols = ols_coef,
lasso = lasso_coef[1:length(lasso_coef)],
lasso_threshold = lasso_threshold_coef[1:length(lasso_threshold_coef)]
)
comparison$difference <- comparison$ols - comparison$lasso
comparison$difference_threshold <- comparison$ols - comparison$lasso_threshold
# Show the two values for the autoregressive parameter and their difference
comparison[1 + selected_market, ]
R
returns:
ols lasso lasso_threshold difference difference_threshold
XSMI 0.9951249 1.022945 0.9951248 -0.02782045 1.796699e-07
回答4:
From glmnet help: Note also that for "gaussian", glmnet standardizes y to have unit variance before computing its lambda sequence (and then unstandardizes the resulting coefficients); if you wish to repro- duce/compare results with other software, best to supply a standardized y.
来源:https://stackoverflow.com/questions/38378118/lasso-with-lambda-0-and-ols-produce-different-results-in-r-glmnet