Looping over combinations of regression model terms

浪尽此生 提交于 2019-11-28 06:13:58

问题


I'm running a regression in the form

reg=lm(y ~ x1+x2+x3+z1,data=mydata)

In the place of the last term, z1, I want to loop through a set of different variables, z1 through z10, running a regression for each with it as the last term. E.g. in second run I want to use

reg=lm(y ~ x1+x2+x3+z2,data=mydata)

in 3rd run:

reg=lm(y ~ x1+x2+x3+z3,data=mydata)

How can I automate this by looping through the list of z-variables?


回答1:


With this dummy data:

dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)

You could get your list of two lm objects this way:

 lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))

Which iterates through those two columns and substitutes them as arguments into the lm call.

As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.




回答2:


While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.

dat1 <- data.frame(y = rpois(100, 5),
                   x1 = runif(100),
                   x2 = runif(100),
                   x3 = runif(100),
                   z1 = runif(100),
                   z2 = runif(100))

lapply(colnames(dat1)[5:6],
       function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
       d = dat1)

Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.

library(microbenchmark)

microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
#       min       lq     mean   median       uq      max neval
#  4.014237 4.148117 4.323387 4.220189 4.281995 5.898811   100

microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
#       min       lq     mean   median       uq    max neval
#  4.391494 4.505056 5.186972 4.598301 4.698818 51.573   100

The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.




回答3:


Depending on what your final goal is, it can be much faster to fit a base model, update it with add1, and extract the F-test/AIC you want:

> basemodel <- lm(y~x1+x2+x3, dat1)
> 
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions

Model:
y ~ x1 + x2 + x3
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              477.34 164.31               
z1      1    0.0768 477.26 166.29  0.0153 0.9019
z2      1    5.1937 472.15 165.21  1.0450 0.3093

See also ?update for refitting the model.




回答4:


Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by() from the dplyr package instead of lapply():

library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
  gather(varname, z, z1:z2) %>% # convert data to long form
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  use_series(model)

This converts the data to a long format using gather, where the z-values occupy the same column. use_series() from the magrittr package return the list of lm objects instead of a data.frame. If you load the broom package, you can extract the model coefficients in this pipeline of code:

library(broom)
dat1 %>%
  gather(varname, z, z1:z2) %>%
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  glance(model) # or tidy(model)

Source: local data frame [2 x 12]
Groups: varname

  varname  r.squared adj.r.squared    sigma statistic   p.value df    logLik      AIC      BIC deviance df.residual
1      z1 0.06606736    0.02674388 2.075924  1.680099 0.1609905  5 -212.3698 436.7396 452.3707 409.3987          95
2      z2 0.06518852    0.02582804 2.076900  1.656192 0.1666479  5 -212.4168 436.8337 452.4647 409.7840          95

Data:

dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
                   x2 = runif(100), x3 = runif(100),
                   z1 = runif(100), z2 = runif(100))


来源:https://stackoverflow.com/questions/30264666/looping-over-combinations-of-regression-model-terms

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!