Residuals from first differenced regression on unbalanced panel

别来无恙 提交于 2020-01-01 18:55:13

问题


I am trying to use plm to estimate a first differenced model on some unbalanced panel data. My model seems to work and I get coefficient estimates, but I want to know if there is a way to get the residual (or fitted value) per observation used.

I have run into two problems, I don't know how to attach residuals to the observation they are associated with, and I seem to get an incorrect number of residuals.

If I retrieve the residuals from the estimated model using model.name$residuals, I get a vector that is shorter than model.name$model.

require(plm)
X <- rnorm(14)
Y <- c(.4,1,1.5,1.3,1,4,5,6.5,7.3,3.7,5,.7,4,6)
Time <- rep(1:5,times=2)
Time <- c(Time, c(1,2,4,5))
ID <- rep(1:2,each=5)
ID <- c(ID,c(3,3,3,3))
TestData <- data.frame("Y"=Y,"X"=X,"ID"=ID,"Time"=Time)
model.name <- plm(Y~X,data=TestData,index = c("ID","Time"),model="fd")

> length(model.name$residuals)
[1] 11
> nrow(model.name$model)
[1] 14

(Note: ID=3 is missing an observation for t=3)

Looking at model.name$model I see it includes all observations, including t=1 for each member of ID. In the first differencing the t=1 observations will be removed, so in this case both IDs with all time periods should have 4 residuals from the remaining time periods. ID=3 should have a residual for t=2, none for t=3 as it is missing, none for t=4 as there is no value to difference (due to the missing t=3 value) and then a residual for t=5.

From this it seems that there should be 10 residuals, but I have 11. I would appreciate any help with why there are this many residuals, and how to connect residuals to the correct index (ID and Time).


回答1:


The lagging done with model="fd" is based on the neighbouring rows, not the actual value of the time index. Thus, if you have non-consecutive time periods, this will give you unexpected results. To avoid this, do the differencing yourself while respecting the time period when lagging and estimate a pooling model. The unbalancedness of the data is not of concern here.

Since version 1.7.0 of package plm, there lag() function performs lagging based on the value of the time period per default (previous default was neighboring rows). Use this function to do the lagging yourself.

Continuing your example:

pTestData <- pdata.frame(TestData, index=c("ID", "Time"))

pTestData$Y_diff <- plm::lag(pTestData$Y) - pTestData$Y
pTestData$X_diff <- plm::lag(pTestData$X) - pTestData$X
fdmod <- plm(Y_diff ~ X_diff, data = pTestData, model = "pooling")
length(residuals(fdmod)) # 10
nrow(fdmod$model)        # 10

I explicity used plm:: when referring to the lag function as several other packages have a lag function as well (most notably stats and dplyr) and you want to use the one from package plm here. To augment the residuals to the differenced data (actually used for computing the model), just do something like: dat <- cbind(fdmod$model, residuals(fdmod))

Also, you might be interested in the function is.pconsecutive to check for consectutiveness of your data:

is.pconsecutive(pTestData)
#    1     2     3 
# TRUE  TRUE FALSE 

Function make.pconsecutive will make your data consecutive by inserting rows with NA values for the missing period.



来源:https://stackoverflow.com/questions/39364471/residuals-from-first-differenced-regression-on-unbalanced-panel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!