问题
I have a variable that is believed to be a good predictor for another variable, but with some lag. I don't know what the lag is and want to estimate it from the data.
Here is am example:
library(tidyverse)
data <- tibble(
id = 1:100,
y = dnorm(1:100, 30, 20) * 1000,
x.shifted = y / 10 + runif(100) / 10,
x.actual = lag(x.shifted, 30)
)
data %>%
ggplot(aes(id, x.shifted)) +
geom_point() +
geom_point(aes(id, x.actual), color = 'blue') +
geom_point(aes(id, y), color = 'red')
The model lm(y ~ x.actual, data)
would not be a great fit, but the model lm(y ~ x.shifted, data)
would be. Here, I know that x must be shifted by -30 days, but imagine I did not and all I knew was that it is between -30 and +30.
The immediate approach that comes to mind is to run 61 regression models, from one that shifts x by -30 to the one that shifts it by +30, and then pick the model with the best AIC or BIC. However, (a) is this the correct approach, and (b) are there R packages that already do this and find the optimal lag?
回答1:
What you are describing is the cross-correlation of the two variables. You can do this very easily in R with ccf
.
However, to just get the optimum lags, we can simplify to a one-liner by using sapply
to feed the number of required lags into the cor
function, then use which.max
to find the highest correlation:
which.max(sapply(1:50, function(i) cor(data$x.actual, lag(data$y, i), use = "complete")))
#> [1] 30
来源:https://stackoverflow.com/questions/62199626/lagged-regression-in-r-determining-the-optimal-lag