Using glmnet to predict a continuous variable in a dataset

隐身守侯 提交于 2019-12-08 13:24:30

问题


I have this data set. wbh

I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.

I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.

Any advice is appreciated.

Thank you for reading


回答1:


Here is an example on how to run glmnet:

library(glmnet)
library(tidyverse)

df is the data set your provided.

select y variable:

y <- df$SP.DYN.TFRT.IN

select numerical variables:

df %>%
  select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
  as.matrix() -> x

select factor variables and convert to dummy variables:

df %>%
  select(region, country.code) %>%
  model.matrix( ~ .-1, .) -> x_train

run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda

cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables

cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables

par(mfrow = c(2,1))
plot(cv_fit)
plot(cv_fit_2)

best lambda:

cv_fit$lambda[which.min(cv_fit$cvm)]

coefficients at best lambda

coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])

equivalent to:

coef(cv_fit, s = "lambda.min")

after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.

Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.



来源:https://stackoverflow.com/questions/47626830/using-glmnet-to-predict-a-continuous-variable-in-a-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!