问题
This is a very simple question, but I haven't been able to find a definitive answer, so I thought I would ask it. I use the plm
package for dealing with panel data. I am attempting to use the lag
function to lag a variable FORWARD in time (the default is to retrieve the value from the previous period, and I want the value from the NEXT). I found a number of old articles/questions (circa 2009) suggesting that this is possible by using k=-1
as an argument. However, when I attempt this, I get an error.
Sample code:
library(plm)
df<-as.data.frame(matrix(c(1,1,1,2,2,3,20101231,20111231,20121231,20111231,20121231,20121231,50,60,70,120,130,210),nrow=6,ncol=3))
names(df)<-c("individual","date","data")
df$date<-as.Date(as.character(df$date),format="%Y%m%d")
df.plm<-pdata.frame(df,index=c("individual","date"))
Lagging:
lag(df.plm$data,0)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
50 60 70 120 130 210
lag(df.plm$data,1)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
NA 50 60 NA 120 NA
lag(df.plm$data,-1)
##returns
Error in rep(1, ak) : invalid 'times' argument
I've also read that plm.data
has replaced pdata.frame
for some applications in plm
. However, plm.data
doesn't seem to work with the lag
function at all:
df.plm<-plm.data(df,indexes=c("individual","date"))
lag(df.plm$data,1)
##returns
[1] 50 60 70 120 130 210
attr(,"tsp")
[1] 0 5 1
I would appreciate any help. If anyone has another suggestion for a package to use for lagging, I'm all ears. However, I do love plm
because it automagically deals with lagging across multiple individuals and skips gaps in the time series.
回答1:
EDIT2: lagging forward (=leading values) is implemented in plm
CRAN releases >= 1.6-4 .
Functions are either lead()
or lag()
(latter with a negative integer for leading values).
Take care of any other packages attached that use the same function names. To be sure, you can refer to the function by the full namespace, e.g., plm::lead
.
Examples from ?plm::lead
:
# First, create a pdata.frame
data("EmplUK", package = "plm")
Em <- pdata.frame(EmplUK)
# Then extract a series, which becomes additionally a pseries
z <- Em$output
class(z)
# compute negative lags (= leading values)
lag(z, -1)
lead(z, 1) # same as line above
identical(lead(z, 1), lag(z, -1)) # TRUE
回答2:
I had this same problem and couldn't find a good solution in plm
or any other package. ddply
was tempting (e.g. s5 = ddply(df, .(country,year), transform, lag=lag(df[, "value-to-lag"], lag=3))
), but I couldn't get the NAs in my lagged column to line up properly for lags other than one.
I wrote a brute force solution that iterates over the dataframe row-by-row and populates the lagged column with the appropriate value. It's horrendously slow (437.33s for my 13000x130 dataframe vs. 0.012s for turning it into a pdata.frame
and using lag
) but it got the job done for me. I thought I would share it here because I couldn't find much information elsewhere on the internet.
In the function below:
df
is your dataframe. The function returnsdf
with a new column containing the forward values.group
is the column name of the grouping variable for your panel data. For example, I had longitudinal data on multiple countries, and I used "Country.Name" here.x
is the column you want to generate lagged values from, e.g. "GDP"forwardx
is the (new) column that will contain the forward lags, e.g. "GDP.next.year".lag
is the number of periods into the future. For example, if your data were taken in annual intervals, usinglag=5
would setforwardx
to the value ofx
five years later.
.
add_forward_lag <- function(df, group, x, forwardx, lag) {
for (i in 1:(nrow(df)-lag)) {
if (as.character(df[i, group]) == as.character(df[i+lag, group])) {
# put forward observation in forwardx
df[i, forwardx] <- df[i+lag, x]
}
else {
# end of group, no forward observation
df[i, forwardx] <- NA
}
}
# last elem(s) in forwardx are NA
for (j in ((nrow(df)-lag+1):nrow(df))) {
df[j, forwardx] <- NA
}
return(df)
}
See sample output using built-in DNase
dataset. This doesn't make sense in context of the dataset, but it lets you see what the columns do.
require(DNase)
add_forward_lag(DNase, "Run", "density", "lagged_density",3)
Grouped Data: density ~ conc | Run
Run conc density lagged_density
1 1 0.04882812 0.017 0.124
2 1 0.04882812 0.018 0.206
3 1 0.19531250 0.121 0.215
4 1 0.19531250 0.124 0.377
5 1 0.39062500 0.206 0.374
6 1 0.39062500 0.215 0.614
7 1 0.78125000 0.377 0.609
8 1 0.78125000 0.374 1.019
9 1 1.56250000 0.614 1.001
10 1 1.56250000 0.609 1.334
11 1 3.12500000 1.019 1.364
12 1 3.12500000 1.001 1.730
13 1 6.25000000 1.334 1.710
14 1 6.25000000 1.364 NA
15 1 12.50000000 1.730 NA
16 1 12.50000000 1.710 NA
17 2 0.04882812 0.045 0.123
18 2 0.04882812 0.050 0.225
19 2 0.19531250 0.137 0.207
Given how long this takes, you may want to use a different approach: backwards-lag all of your other variables.
来源:https://stackoverflow.com/questions/13037389/lagging-forward-in-plm