Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:

fit model      predict
----------     -------
dat[1:3, ]     dat[4, ]
dat[1:4, ]     dat[5, ]
    .             .
    .             .
dat[-1, ]      dat[nrow(dat), ]

I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do

dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]

fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)

How can I do this automatically for all subsets, and potentially extract what I want into a table?

From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.

Thanks.

(Efficient) solution

This is what you can do:

p <- 3  ## number of parameters in lm()
n <- nrow(dat) - 1

## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
  fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
  pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
  c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
  }

## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")

Note I have done several things inside the bundle function:

I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace

Overall, there is no obvious loop, but sapply is used.

Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.

Test

Example data (with 30 "observations")

dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
                  v12 = runif(30, 1, 100))

Applying code above gives results (27 rows in total, truncated output for 5 rows)

            adj.r2 prediction        se
 [1,]          NaN   3.881068       NaN
 [2,]  0.106592619   3.676821 0.7517040
 [3,]  0.545993989   3.892931 0.2758347
 [4,]  0.622612495   3.766101 0.1508270
 [5,]  0.180462206   3.996344 0.2059014

The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.

I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).

(Efficient) Solution

data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))

data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]

for(i in 3:nrow(data)){
  nam  <- paste("predict", i, sep = "")
  nam1 <- paste("fit", i, sep = "")
  nam2 <- paste("summary_fit", i, sep = "")

  fit = lm(clicks ~ v1 + v2, data=data[1:i,])
  tmp  <- predict(fit, newdata=data2, se.fit=TRUE)
  tmp1 <- fit
  tmp2 <- summary(fit)
  assign(nam, tmp)
  assign(nam1, tmp1)
  assign(nam2, tmp2)
}

All of the results you want will be stored in the data objects this creates.

For example:

> summary_fit10$r.squared
[1] 0.3087432

You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:

rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()

frames.fit     <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum     <- frames[197:294]

fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
  tmp <- get(frames.fit[i])
  fit.table              <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}

fit.table

> fit.table
             intercept                   v1                   v2 sourcedf
2  -0.0647017971121678     1.34929652763687   -0.300502017324518    fit10
3  -0.0401617893034109   -0.034750571912636  -0.0843076273486442   fit100
4   0.0132968863522573     1.31283604433593   -0.388846211083564    fit11
5   0.0315113918953643     1.31099122173898   -0.371130010135382    fit12
6    0.149582794027583    0.958692838785998   -0.299479715938493    fit13
7  0.00759688947362175    0.703525856001948   -0.297223988673322    fit14
8    0.219756240025917    0.631961979610744   -0.347851129205841    fit15
9     0.13389223748979    0.560583832333355   -0.276076134872669    fit16
10   0.147258022154645    0.581865844000838   -0.278212722024832    fit17
11  0.0592160359650468    0.469842498721747   -0.163187274356457    fit18
12   0.120640756525163    0.430051839741539   -0.201725012088506    fit19
13   0.101443924785995     0.34966728554219   -0.231560038360121    fit20
14  0.0416637001406594    0.472156988919337   -0.247684504074867    fit21
15 -0.0158319749710781    0.451944113682333   -0.171367482879835    fit22
16 -0.0337969739950376    0.423851304105399   -0.157905431162024    fit23
17  -0.109460218252207     0.32206642419212   -0.055331391802687    fit24
18  -0.100560410735971    0.335862465403716  -0.0609509815266072    fit25
19  -0.138175283219818    0.390418411384468  -0.0873106257144312    fit26
20  -0.106984355317733    0.391270279253722  -0.0560299858019556    fit27
21 -0.0740684978271464    0.385267011513678  -0.0548056844433894    fit28

来源：https://stackoverflow.com/questions/38041167/rolling-regression-and-prediction-with-lm-and-predict

标签

regression

linear-regression

predict