I'm running a regression in the form
reg=lm(y ~ x1+x2+x3+z1,data=mydata)
In the place of the last term, z1
, I want to loop through a set of different variables, z1
through z10
, running a regression for each with it as the last term. E.g. in second run I want to use
reg=lm(y ~ x1+x2+x3+z2,data=mydata)
in 3rd run:
reg=lm(y ~ x1+x2+x3+z3,data=mydata)
How can I automate this by looping through the list of z-variables?
With this dummy data:
dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)
You could get your list of two lm
objects this way:
lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))
Which iterates through those two columns and substitutes them as arguments into the lm
call.
As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.
While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.
dat1 <- data.frame(y = rpois(100, 5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100))
lapply(colnames(dat1)[5:6],
function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
d = dat1)
Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.
library(microbenchmark)
microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
# min lq mean median uq max neval
# 4.014237 4.148117 4.323387 4.220189 4.281995 5.898811 100
microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
# min lq mean median uq max neval
# 4.391494 4.505056 5.186972 4.598301 4.698818 51.573 100
The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.
Depending on what your final goal is, it can be much faster to fit a base model, update it with add1
, and extract the F-test/AIC you want:
> basemodel <- lm(y~x1+x2+x3, dat1)
>
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions
Model:
y ~ x1 + x2 + x3
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 477.34 164.31
z1 1 0.0768 477.26 166.29 0.0153 0.9019
z2 1 5.1937 472.15 165.21 1.0450 0.3093
See also ?update
for refitting the model.
Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by()
from the dplyr package instead of lapply()
:
library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
gather(varname, z, z1:z2) %>% # convert data to long form
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
use_series(model)
This converts the data to a long format using gather
, where the z-values occupy the same column. use_series()
from the magrittr package return the list of lm
objects instead of a data.frame
. If you load the broom package, you can extract the model coefficients in this pipeline of code:
library(broom)
dat1 %>%
gather(varname, z, z1:z2) %>%
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
glance(model) # or tidy(model)
Source: local data frame [2 x 12]
Groups: varname
varname r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 z1 0.06606736 0.02674388 2.075924 1.680099 0.1609905 5 -212.3698 436.7396 452.3707 409.3987 95
2 z2 0.06518852 0.02582804 2.076900 1.656192 0.1666479 5 -212.4168 436.8337 452.4647 409.7840 95
Data:
dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
x2 = runif(100), x3 = runif(100),
z1 = runif(100), z2 = runif(100))
来源:https://stackoverflow.com/questions/30264666/looping-over-combinations-of-regression-model-terms