H2o GLM interact only certain predictors

I'm interested in creating interaction terms in h2o.glm(). But I do not want to generate all pairwise interactions. For example, in the mtcars dataset...I want to interact 'mpg' with all the other factors such as 'cyl','hp', and 'disp' but I don't want the other factors to interact with each other (so I don't want disp_hp or disp_cyl).

How should I best approach this problem using the (interactions = interactions_list) parameter in h2o.glm() ?

Thank you

According to ?h2o.glm the interactions= parameter takes:

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

You do not want all pairwise combinations, only specific ones.

Unfortunately, the R H2O API does not provide a formula interface. If it did, then an arbitrary set of interactions would be possible to specify programatically, as in a vanilla R glm.¹

Option 1: Use `beta_constraints`

One solution is to include all pairwise combinations in the model and then suppress those you do not want by setting the betas equal to 0.

According to the glm docs, beta_constraints= serves to:

Specify a dataset to use beta constraints. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.

According to the H2O Glossary, the format for beta_constraints is:

A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”,”upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower_bounds” and “upper_bounds” are the lower and upper bounds of beta, and “beta_given” is some supplied starting values for beta.

Now we know how to fill out our beta_constraints data frame except for how to format the interaction term names. The doc on interactions doesn't tell us what to expect. So let's just run an example with interactions through H2O and see what the interactions get named.

library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)

data(mtcars)

df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')

target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')

glm1 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0, # disable regularization, but your use case may vary
                standardize = FALSE, # we want to see the raw parameters, but your use case may vary
                interactions = predictors # create all interactions
                )
print(glm1) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     4.336269
# 2    mpg_cyl     0.019558
# 3     mpg_hp     0.000156
# ..

So it looks like the interaction terms are getting named like v1_v2.

So let's name all the interaction terms we want to suppress, using setdiff() against the terms we want to keep.

library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep='_') %>% unlist()

intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
                             combn(predictors,2,FUN=paste,collapse='_'), 
                             intx_terms_keep
                            )
constraints <- data.frame(names=intx_terms_suppress, 
                          lower_bounds=0, 
                          upper_bounds=0, 
                          beta_given=0)

glm2 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0,
                standardize = FALSE, 
                interactions = predictors, # create all interactions
                beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     3.405154
# 2    mpg_cyl    -0.012740
# 3     mpg_hp    -0.000250
# 4   mpg_disp     0.000066
# 5     cyl_hp     0.000000
# 6   cyl_disp     0.000000
# 7    hp_disp     0.000000
# 8        mpg    -0.018981
# 9        cyl     0.168820
# 10      disp     0.004070
# 11        hp     0.000501

As you can see, only the desired interaction terms have non-zero coefficients. The rest are effectively ignored. However, since they are still terms in the model, they may count towards degrees of freedom and may affect some of the metrics (i.e., adjusted R-squared).

Option 2: pre-create the interaction terms

As @Darren Cook mentioned, another solution would be to pre-create the interactions as variables in the training dataset.

This approach would ensure that the unwanted interactions do not count towards degrees of freedom and impact your adjusted R-squared.

¹ Alternative, non-H2O solution for vanilla `glm` formula interface

In a vanilla R glm(), which allows the formula interface, I would use expand.grid to create a string of interaction terms and include it in the formula.

Pass expand.grid two vectors -- you want to interact all terms in v1 with all terms in v2.

To use your example, you want to interact mpg with cyl, hp, and disp:

library(tidyr)
intx_term_string <- 
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')

This gives you a string of interaction terms like "mpg:cyl+mpg:hp+mpg:disp" that you can paste into a string of other predictors (possibly using paste-collapse) and convert with as.formula().

来源：https://stackoverflow.com/questions/45426642/h2o-glm-interact-only-certain-predictors

标签

glm

h2o

one-hot-encoding

H2o GLM interact only certain predictors

Option 1: Use beta_constraints

Option 2: pre-create the interaction terms

1 Alternative, non-H2O solution for vanilla glm formula interface

Option 1: Use `beta_constraints`

¹ Alternative, non-H2O solution for vanilla `glm` formula interface