I'm interested in creating interaction terms in h2o.glm(). But I do not want to generate all pairwise interactions. For example, in the mtcars dataset...I want to interact 'mpg' with all the other factors such as 'cyl','hp', and 'disp' but I don't want the other factors to interact with each other (so I don't want disp_hp or disp_cyl).
How should I best approach this problem using the (interactions = interactions_list) parameter in h2o.glm() ?
Thank you
According to ?h2o.glm
the interactions=
parameter takes:
A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
You do not want all pairwise combinations, only specific ones.
Unfortunately, the R H2O API does not provide a formula interface. If it did, then an arbitrary set of interactions would be possible to specify programatically, as in a vanilla R glm.1
Option 1: Use beta_constraints
One solution is to include all pairwise combinations in the model and then suppress those you do not want by setting the betas equal to 0.
According to the glm docs, beta_constraints=
serves to:
Specify a dataset to use beta constraints. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.
According to the H2O Glossary, the format for beta_constraints
is:
A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”,”upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower_bounds” and “upper_bounds” are the lower and upper bounds of beta, and “beta_given” is some supplied starting values for beta.
Now we know how to fill out our beta_constraints
data frame except for how to format the interaction term names. The doc on interactions doesn't tell us what to expect.
So let's just run an example with interactions through H2O and see what the interactions get named.
library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)
data(mtcars)
df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')
target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')
glm1 <- h2o.glm(x = predictors,
y = target,
training_frame = 'demo_mtcars',
model_id = 'demo_glm',
lambda = 0, # disable regularization, but your use case may vary
standardize = FALSE, # we want to see the raw parameters, but your use case may vary
interactions = predictors # create all interactions
)
print(glm1) # output includes:
# Coefficients: glm coefficients
# names coefficients
# 1 Intercept 4.336269
# 2 mpg_cyl 0.019558
# 3 mpg_hp 0.000156
# ..
So it looks like the interaction terms are getting named like v1_v2
.
So let's name all the interaction terms we want to suppress, using setdiff()
against the terms we want to keep.
library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
unite(intx, Var1, Var2, sep='_') %>% unlist()
intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
combn(predictors,2,FUN=paste,collapse='_'),
intx_terms_keep
)
constraints <- data.frame(names=intx_terms_suppress,
lower_bounds=0,
upper_bounds=0,
beta_given=0)
glm2 <- h2o.glm(x = predictors,
y = target,
training_frame = 'demo_mtcars',
model_id = 'demo_glm',
lambda = 0,
standardize = FALSE,
interactions = predictors, # create all interactions
beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
# names coefficients
# 1 Intercept 3.405154
# 2 mpg_cyl -0.012740
# 3 mpg_hp -0.000250
# 4 mpg_disp 0.000066
# 5 cyl_hp 0.000000
# 6 cyl_disp 0.000000
# 7 hp_disp 0.000000
# 8 mpg -0.018981
# 9 cyl 0.168820
# 10 disp 0.004070
# 11 hp 0.000501
As you can see, only the desired interaction terms have non-zero coefficients. The rest are effectively ignored. However, since they are still terms in the model, they may count towards degrees of freedom and may affect some of the metrics (i.e., adjusted R-squared).
Option 2: pre-create the interaction terms
As @Darren Cook mentioned, another solution would be to pre-create the interactions as variables in the training dataset.
This approach would ensure that the unwanted interactions do not count towards degrees of freedom and impact your adjusted R-squared.
1 Alternative, non-H2O solution for vanilla glm
formula interface
In a vanilla R glm()
, which allows the formula interface, I would use expand.grid
to create a string of interaction terms and include it in the formula.
Pass expand.grid
two vectors -- you want to interact all terms in v1 with all terms in v2.
To use your example, you want to interact mpg
with cyl
, hp
, and disp
:
library(tidyr)
intx_term_string <-
expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')
This gives you a string of interaction terms like "mpg:cyl+mpg:hp+mpg:disp"
that you can paste into a string of other predictors (possibly using paste-collapse) and convert with as.formula()
.
来源:https://stackoverflow.com/questions/45426642/h2o-glm-interact-only-certain-predictors