I would like to extract the glmnet generated model coefficients and create a SQL query from them. The function coef(cv.glmnet.fit)
yields a \'dgCMa
Here, I wrote a reproducible example and fitted a binary (logistic) example using cv.glmnet
. A glmnet
model fit will also work. At the end of this example, I assembled non-zero coefficients, and associated features, into a data.frame called myResults
:
library(glmnet)
X <- matrix(rnorm(100*10), 100, 10);
X[51:100, ] <- X[51:100, ] + 0.5; #artificially introduce difference in control cases
rownames(X) <- paste0("observation", 1:nrow(X));
colnames(X) <- paste0("feature", 1:ncol(X));
y <- factor( c(rep(1,50), rep(0,50)) ); #binary outcome class label
y
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
## Perform logistic model fit:
fit1 <- cv.glmnet(X, y, family="binomial", nfolds=5, type.measure="auc"); #with K-fold cross validation
# fit1 <- glmnet(X, y, family="binomial") #without cross validation also works
## Adapted from @Mehrad Mahmoudian:
myCoefs <- coef(fit1, s="lambda.min");
myCoefs[which(myCoefs != 0 ) ] #coefficients: intercept included
## [1] 1.4945869 -0.6907010 -0.7578129 -1.1451275 -0.7494350 -0.3418030 -0.8012926 -0.6597648 -0.5555719
## [10] -1.1269725 -0.4375461
myCoefs@Dimnames[[1]][which(myCoefs != 0 ) ] #feature names: intercept included
## [1] "(Intercept)" "feature1" "feature2" "feature3" "feature4" "feature5" "feature6"
## [8] "feature7" "feature8" "feature9" "feature10"
## Asseble into a data.frame
myResults <- data.frame(
features = myCoefs@Dimnames[[1]][ which(myCoefs != 0 ) ], #intercept included
coefs = myCoefs [ which(myCoefs != 0 ) ] #intercept included
)
myResults
## features coefs
## 1 (Intercept) 1.4945869
## 2 feature1 -0.6907010
## 3 feature2 -0.7578129
## 4 feature3 -1.1451275
## 5 feature4 -0.7494350
## 6 feature5 -0.3418030
## 7 feature6 -0.8012926
## 8 feature7 -0.6597648
## 9 feature8 -0.5555719
## 10 feature9 -1.1269725
## 11 feature10 -0.4375461
Check broom package. It has tidy
function that converts output of different R objects (including glmnet
) into data.frames.
There is an approach with using coef() to glmnet() object (your model). In a case below index [[1]] indicate the number of outcome class in multinomial logistic regression, maybe for other models you shoould remove it.
coef_names_GLMnet <- coef(GLMnet, s = 0)[[1]]
row.names(coef_names_GLMnet)[coef_names_GLMnet@i+1]
row.names() indexes in such case needs incrementing (+1) because numeration of variables (data features) in coef() object begining from 0, but after transformation character vector numeration begining from 1.
UPDATE: Both first two comments of my answer are right. I have kept the answer below the line just for posterity.
The following answer is short, it works and does not need any other package:
tmp_coeffs <- coef(cv.glmnet.fit, s = "lambda.min")
data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient = tmp_coeffs@x)
The reason for +1 is that the @i
method indexes from 0 for the intercept but @Dimnames[[1]]
starts at 1.
OLD ANSWER: (only kept for posterity) Try these lines:
The non zero coefficients:
coef(cv.glmnet.fit, s = "lambda.min")[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]
The features that are selected:
colnames(regression_data)[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]
Then putting them together as a dataframe is staight forward, but let me know if you want that part of the code also.
# requires tibble.
tidy_coef <- function(x){
coef(x) %>%
matrix %>% # Coerce from sparse matrix to regular matrix.
data.frame %>% # Then dataframes.
rownames_to_column %>% # Add rownames as explicit variables.
setNames(c("term","estimate"))
}
Without tibble:
tidy_coef2 <- function(x){
x <- coef(x)
data.frame(term=rownames(x),
estimate=matrix(x)[,1],
stringsAsFactors = FALSE)
}
The names should be accessible as dimnames(coef(cv.glmnet.fit))[[1]]
, so the following should put both coefficient names and values into a data.frame:
data.frame(coef.name = dimnames(coef(GLMNET))[[1]], coef.value = matrix(coef(GLMNET)))