Lagged regression in R: determining the optimal lag

问题

I have a variable that is believed to be a good predictor for another variable, but with some lag. I don't know what the lag is and want to estimate it from the data.

Here is am example:

library(tidyverse)

data <- tibble(
  id = 1:100,
  y = dnorm(1:100, 30, 20) * 1000,
  x.shifted = y / 10 + runif(100) / 10,
  x.actual = lag(x.shifted, 30)
)

data %>% 
  ggplot(aes(id, x.shifted)) +
  geom_point() +
  geom_point(aes(id, x.actual), color = 'blue') +
  geom_point(aes(id, y), color = 'red')

The model lm(y ~ x.actual, data) would not be a great fit, but the model lm(y ~ x.shifted, data) would be. Here, I know that x must be shifted by -30 days, but imagine I did not and all I knew was that it is between -30 and +30.

The immediate approach that comes to mind is to run 61 regression models, from one that shifts x by -30 to the one that shifts it by +30, and then pick the model with the best AIC or BIC. However, (a) is this the correct approach, and (b) are there R packages that already do this and find the optimal lag?

回答1:

What you are describing is the cross-correlation of the two variables. You can do this very easily in R with ccf.

However, to just get the optimum lags, we can simplify to a one-liner by using sapply to feed the number of required lags into the cor function, then use which.max to find the highest correlation:

which.max(sapply(1:50, function(i) cor(data$x.actual, lag(data$y, i), use = "complete")))
#> [1] 30

来源：https://stackoverflow.com/questions/62199626/lagged-regression-in-r-determining-the-optimal-lag

标签

linear-regression