问题
Goal
- Create a LASSO model using MLR3
- Use nested CV with inner CV or bootstraps for hyperparameter (lambda) determination and outer CV for model performance evaluation (instead of doing just one test-train spit) and finding the standard deviation of the different LASSO regression coefficients amongst the different model instances.
- Do a prediction on a testing data set not available yet.
Issues
- I am unsure whether the nested CV approach as described is implemented correctly in my code below.
- I am unsure whether alpha is set correctly alpha = 1 only.
- I do not know how to access the LASSO lamda coefficients when using resampling in mlr3. (importance() in mlr3learners does not yet support LASSO)
- I don't know how to apply a possible model to the unavailable testing set in mlr3.
Code
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double()
#> )
#> See spec(...) for full column specifications.
# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))
# Task creation
task =
TaskRegr$new(
id = 'office',
backend = data,
target = 'imdb_rating'
)
# Model creation
graph =
po('scale') %>>%
po('encode') %>>% # make factors numeric
# How to normalize predictors, leaving target unchanged?
lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
id = 'rp', alpha = 1, family = 'gaussian'
)
graph_learner = GraphLearner$new(graph)
# Execution (actual modeling)
result =
resample(
task,
graph_learner,
rsmp('cv', folds = 5) # 5-fold for outer CV
)
#> INFO [13:21:53.485] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 3/5)
#> INFO [13:21:54.937] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 2/5)
#> INFO [13:21:55.242] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 1/5)
#> INFO [13:21:55.500] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 4/5)
#> INFO [13:21:55.831] Applying learner 'scale.encode.regr.cv_glmnet' on task 'office' (iter 5/5)
# How to access results, i.e. lamda coefficients,
# and compare them (why no variable importance for glmnet)
# Access prediction
result$prediction()
#> <PredictionRegr> for 136 observations:
#> row_id truth response
#> 2 8.3 8.373798
#> 6 8.7 8.455151
#> 9 8.4 8.358964
#> ---
#> 116 9.7 8.457607
#> 119 8.2 8.130352
#> 128 7.8 8.224150
Created on 2020-06-11 by the reprex package (v0.3.0)
Edit 1 (LASSO coefficients)
According to a comment from missuse LASSO coefficients can be accessed through result$data$learner[[1]]$model$rp$model$glmnet.fit$beta
Additionally, I found that store_models = TRUE
needs to be set in result to store the model and in turn access the coefficients.
- despite setting alpha = 1, I optained multiple LASSO coefficients. I would like the 'best' LASSO coefficients (stemming from e. g. from lamda = lamda.min or lamda.1se). What do the different s1, s2, s3, ... mean? Are these different lamdas?
- The different coefficients indeed seem to stem from different lambda values denoted as s1, s2 , s3, ... (Numer is index.) I suppose, the 'best' coefficients can be accessed by first finding the indices of the 'best' lambda
index_lamda.1se = which(ft$lambda == ft$lambda.1se)[[1]]; index_lamda.min = which(ft$lambda == ft$lambda.min)[[1]]
and then finding the set of coefficients. A more concise approach to find the 'best' coefficients is given in the comments by missuse.
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
# Add a factor to data
data$factor = as.factor(c(rep('a', 20), rep('b', 50), rep('c', 30), rep('a', 6), rep('c', 10), rep('b', 20)))
# Task creation
task =
TaskRegr$new(
id = 'office',
backend = data,
target = 'imdb_rating'
)
# Model creation
graph =
po('scale') %>>%
po('encode') %>>% # make factors numeric
# How to normalize predictors, leaving target unchanged?
lrn('regr.cv_glmnet', # 10-fold CV for inner loop. Is alpha permanently set to 1?
id = 'rp', alpha = 1, family = 'gaussian'
)
graph$keep_results = TRUE
graph_learner = GraphLearner$new(graph)
# Execution (actual modeling)
result =
resample(
task,
graph_learner,
rsmp('cv', folds = 5), # 5-fold for outer CV
store_models = TRUE # Store model needed to acces coefficients
)
# LASSO coefficients
# Why more than one coefficient per predictor?
# What are s1, s2 etc.? Shouldn't 'lrn' fix alpha = 1?
# How to obtain the best coefficient (for lamda 1se or min) if multiple?
as.matrix(result$data$learner[[1]]$model$rp$model$glmnet.fit$beta)
#> s0 s1 s2 s3 s4 s5
#> andy 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> angela 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> b_j_novak 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> brent_forrester 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> darryl 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> dwight 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> episode 0 0.000000000 0.00000000 0.00000000 0.010297763 0.02170423
#> erin 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> gene_stupnitsky 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> greg_daniels 0 0.000000000 0.00000000 0.00000000 0.001845101 0.01309437
#> jan 0 0.000000000 0.00000000 0.00000000 0.005663699 0.01357832
#> jeffrey_blitz 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jennifer_celotta 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> jim 0 0.006331732 0.01761548 0.02789682 0.036853510 0.04590513
#> justin_spitzer 0 0.000000000 0.00000000 0.00000000 0.000000000 0.00000000
#> [...]
#> s6 s7 s8 s9 s10
#> andy 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> angela 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> b_j_novak 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> brent_forrester 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> darryl 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> dwight 0.002554576 0.007006995 0.011336058 0.01526851 0.01887180
#> episode 0.031963475 0.040864492 0.047487987 0.05356482 0.05910066
#> erin 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> gene_stupnitsky 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> greg_daniels 0.023040791 0.031866343 0.040170917 0.04779004 0.05472702
#> jan 0.021030152 0.028094541 0.035062678 0.04143812 0.04725379
#> jeffrey_blitz 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> jennifer_celotta 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> jim 0.053013058 0.058503984 0.062897112 0.06683734 0.07041964
#> justin_spitzer 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> kelly 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> ken_kwapis 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> kevin 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> lee_eisenberg 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> michael 0.057190859 0.062963830 0.068766981 0.07394472 0.07865977
#> mindy_kaling 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> oscar 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> pam 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> paul_feig 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> paul_lieberstein 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> phyllis 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> randall_einhorn 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> ryan 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> season 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> toby 0.000000000 0.000000000 0.005637169 0.01202893 0.01785309
#> factor.a 0.000000000 -0.003390125 -0.022365768 -0.03947047 -0.05505681
#> factor.b 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> factor.c 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
#> s11 s12 s13 s14
#> andy 0.000000000 0.000000000 0.000000000 0.0000000000
#> angela 0.000000000 0.000000000 0.000000000 0.0000000000
#> b_j_novak 0.000000000 0.000000000 0.000000000 0.0000000000
#> brent_forrester 0.000000000 0.000000000 0.000000000 0.0000000000
#> darryl 0.000000000 0.000000000 0.000000000 0.0017042281
#> dwight 0.022170870 0.025326337 0.027880703 0.0303865693
#> episode 0.064126846 0.069018240 0.074399623 0.0794693480
#> [...]
Created on 2020-06-15 by the reprex package (v0.3.0)
Edit 2 (optional follow up question)
Nested CV provides discrepancy-evalutation amongst multiple models. The discrepancy can be expressed as an error (e.g. RMSE) obtained by the outer CV. While that error may be small, individual LASSO coefficients (importance of predictors) from the models (instanciated by the outer CV) may vary considerably.
- Does mlr3 provide functionality describing the consitancy in quantitative importance of predictor variables, i. e. RMSE of LASSO coefficients amongst models created by the outer CV? Or should a custom function be created, retrieving the LASSO coefficients using
result$data$learner[[i]]$model$rp$model$glmnet.fit$beta
(suggested by missuse) with i = 1, 2, 3, 4, 5 being the folds of the outer CV and then taking RMSE of the matching coefficients?
来源:https://stackoverflow.com/questions/62323004/how-to-access-and-compare-lasso-model-coefficients-with-mlr3-glmnet-learner