H2o GLM interact only certain predictors

大城市里の小女人 提交于 2019-12-05 20:06:22

According to ?h2o.glm the interactions= parameter takes:

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

You do not want all pairwise combinations, only specific ones.

Unfortunately, the R H2O API does not provide a formula interface. If it did, then an arbitrary set of interactions would be possible to specify programatically, as in a vanilla R glm.1

Option 1: Use beta_constraints

One solution is to include all pairwise combinations in the model and then suppress those you do not want by setting the betas equal to 0.

According to the glm docs, beta_constraints= serves to:

Specify a dataset to use beta constraints. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.

According to the H2O Glossary, the format for beta_constraints is:

A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”,”upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower_bounds” and “upper_bounds” are the lower and upper bounds of beta, and “beta_given” is some supplied starting values for beta.

Now we know how to fill out our beta_constraints data frame except for how to format the interaction term names. The doc on interactions doesn't tell us what to expect. So let's just run an example with interactions through H2O and see what the interactions get named.

library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)

data(mtcars)

df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')

target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')

glm1 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0, # disable regularization, but your use case may vary
                standardize = FALSE, # we want to see the raw parameters, but your use case may vary
                interactions = predictors # create all interactions
                )
print(glm1) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     4.336269
# 2    mpg_cyl     0.019558
# 3     mpg_hp     0.000156
# ..

So it looks like the interaction terms are getting named like v1_v2.

So let's name all the interaction terms we want to suppress, using setdiff() against the terms we want to keep.

library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep='_') %>% unlist()

intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
                             combn(predictors,2,FUN=paste,collapse='_'), 
                             intx_terms_keep
                            )
constraints <- data.frame(names=intx_terms_suppress, 
                          lower_bounds=0, 
                          upper_bounds=0, 
                          beta_given=0)

glm2 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0,
                standardize = FALSE, 
                interactions = predictors, # create all interactions
                beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     3.405154
# 2    mpg_cyl    -0.012740
# 3     mpg_hp    -0.000250
# 4   mpg_disp     0.000066
# 5     cyl_hp     0.000000
# 6   cyl_disp     0.000000
# 7    hp_disp     0.000000
# 8        mpg    -0.018981
# 9        cyl     0.168820
# 10      disp     0.004070
# 11        hp     0.000501

As you can see, only the desired interaction terms have non-zero coefficients. The rest are effectively ignored. However, since they are still terms in the model, they may count towards degrees of freedom and may affect some of the metrics (i.e., adjusted R-squared).

Option 2: pre-create the interaction terms

As @Darren Cook mentioned, another solution would be to pre-create the interactions as variables in the training dataset.

This approach would ensure that the unwanted interactions do not count towards degrees of freedom and impact your adjusted R-squared.

1 Alternative, non-H2O solution for vanilla glm formula interface

In a vanilla R glm(), which allows the formula interface, I would use expand.grid to create a string of interaction terms and include it in the formula.

Pass expand.grid two vectors -- you want to interact all terms in v1 with all terms in v2.

To use your example, you want to interact mpg with cyl, hp, and disp:

library(tidyr)
intx_term_string <- 
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')

This gives you a string of interaction terms like "mpg:cyl+mpg:hp+mpg:disp" that you can paste into a string of other predictors (possibly using paste-collapse) and convert with as.formula().

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!