Merge plm fitted values to dataset

问题

I'm working with a fixed effects regression model using plm.

The model looks like this:

FE.model <-plm(fml, data = data.reg2,
           index=c('Site.ID','date.hour'), # cross section ID and time series ID
           model='within', #coefficients are fixed
           effect='individual')
summary(FE.model)

"fml" is a formula I defined previously. I have many independent variables, so this made it more efficient.

What I want to do is get my fitted values (my yhats) and join them to my base dataset; data.reg2

I was able to get the fitted values using this code:

 Fe.model.fitted <- FE.model$model[[1]] - FE.model$residuals

However, this only gives me a one column vector of fitted values only - I have no way of joining it to my base dataset.

Alternatively, I've tried something like this:

 Fe.model.fitted <- cbind(data.reg2, resid=resid(FE.model), fitted=fitted(FE.model))

However, I get this error with that:

 Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""pseries"" to a data.frame

Are there any other ways to get my fitted values in my base dataset? Or can someone explain the error I'm getting and maybe a way to fix it?

I should note that I don't want to manually compute the yhats based on my betas. I have way too many independent variables for that option and my defined formula (fml) may change so that option would not be efficient.

Many thanks!!

回答1:

Merging plm fitted values back into the original dataset requires some intermediate steps -- plm drops any rows with missing data, and as far as I can tell, a plm object does not contain the index info. The order of the data is not preserved (see Millo Giovanni's comment in this thread: "the input order is not always preserved").

The steps in short:

Get fitted values from the estimated plm object. It is a single vector but the entries are named. The names correspond to the position in the index.
Get the index, using the index() function. It can return both individual and time indices. Note the index may contain more rows than the fitted values, in case rows were removed for missing data. (It is also possible to generate an index directly from the original data, but I did not see a promise that the original order of the data is preserved in what plm returns.)
Merge into the original data, looking up the id and time values from the index.

Sample code is provided below. Kind of long but I've tried to comment. The code is not optimized, my intention was to list the steps explicitly. Also, I am using data.tables rather than data.frames.

library(data.table); library(plm)

### Generate dummy data. This way we know the "true" coefficients
set.seed(100)
n <- 500 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.table(CJ(id = c("a","b","c","d","e"), time = c(1:(n / 5))))
DT[, x1 := rnorm(n)]
DT[, x2 := rnorm(n)]
DT[, y  := x1 + 2 * x2 + rnorm(n) / 10]

setkey(DT, id, time)
# # Make it an unbalanced panel & put in some NAs
DT <- DT[!(id == "a" & time == 4)]
DT[.("a", 3), x2 := as.numeric(NA)]
DT[.("d", 2), x2 := as.numeric(NA)]

str(DT)

### Run the model -- both individual and time effects; "within" model
summary(PLM <- plm(data = DT, id = c("id", "time"), formula = y ~ x1 + x2, model = "within", effect = "twoways", na.action = "na.omit"))

### Merge the fitted values back into the data.table DT
# Note that PLM$model$y is shorter than the data, i.e. the row(s) with NA have been dropped
cat("\nRows omitted (due to NA): ", nrow(DT) - length(PLM$model$y))

# Since the objects returned by plm() do not contain the index, need to generate it from the data
# The object returned by plm(), i.e. PLM$model$y, has names that point to the place in the index
# Note: The index can also be done as INDEX <- DT[, j = .(id, time)], but use the longer way with index() in case plm does not preserve the order
INDEX <- data.table(index(x = pdata.frame(x = DT, index = c("id", "time")), which = NULL)) # which = NULL extracts both the individual and time indexes
INDEX[, id := as.character(id)]
INDEX[, time := as.integer(time)] # it is returned as a factor, convert back to integer to match the variable type in DT

# Generate the fitted values as the difference between the y values and the residuals
if (all(names(PLM$residuals) == names(PLM$model$y))) { # this should not be needed, but just in case...
    FIT <- data.table(
        index   = as.integer(names(PLM$model$y)), # this index corresponds to the position in the INDEX, from where we get the "id" and "time" below
        fit.plm = as.numeric(PLM$model$y) - as.numeric(PLM$residuals)
    )
}

FIT[, id   := INDEX[index]$id]
FIT[, time := INDEX[index]$time]
# Now FIT has both the id and time variables, can match it back into the original dataset (i.e. we have the missing data accounted for)
DT <- merge(x = DT, y = FIT[, j = .(id, time, fit.plm)], by = c("id", "time"), all = TRUE) # Need all = TRUE, or some data from DT will be dropped!

回答2:

The residuals are deviation of the model from the value on the LHS of the formula .... which you have not shown to us. There is a fitted.panelmodel function in the 'plm' package, but it appears to expect that there will be a fitted value which the plm function does not return by default, nor is it documented to do so, nor is the a way that I see to make it cough one up.

library(plm)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
          data = Produc, index = c("state","year"))
summary(zz)  # the example on the plm page:
> str(fitted(zz))
 NULL
> names(zz$model)
[1] "log(gsp)"  "log(pcap)" "log(pc)"   "log(emp)"  "unemp"    
> Produc[ , c("Yvar", "Fitted")] <- cbind( zz$model[ ,"log(gsp)", drop=FALSE], zz$residuals)
> str(Produc)
'data.frame':   816 obs. of  12 variables:
 $ state : Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year  : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
 $ pcap  : num  15033 15502 15972 16406 16763 ...
 $ hwy   : num  7326 7526 7765 7908 8026 ...
 $ water : num  1656 1721 1765 1742 1735 ...
 $ util  : num  6051 6255 6442 6756 7002 ...
 $ pc    : num  35794 37300 38670 40084 42057 ...
 $ gsp   : int  28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
 $ emp   : num  1010 1022 1072 1136 1170 ...
 $ unemp : num  4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
 $ Yvar  :Classes 'pseries', 'pseries', 'integer'  atomic [1:816] 10.3 10.3 10.4 10.4 10.4 ...
  .. ..- attr(*, "index")='data.frame': 816 obs. of  2 variables:
  .. .. ..$ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ year : Factor w/ 17 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Fitted: num  -0.04656 -0.03064 -0.01645 -0.00873 -0.02708 ...

回答3:

I have a simplified method. The main problem here is twofold:

1) pdata.frames sort your input alphabetically by name, then year. This can be addressed by sorting your data frame first before running plm.

2) rows with NA in variables included in the formula are dropped. I handle this problem by creating a second formula including my id and time variable, and then use model.frame to extract the data used in the regression (excluding NAs but now also includes id and time)

library(plm)
set.seed(100)
n <- 10 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.frame(id = c("a","c","b","d","e"), time = c(1:(n / 5)),x1 = rnorm(n),x2= rnorm(n),x3=rnorm(n))
DT$Y = DT$x2 + 2 * DT$x3 + rnorm(n) / 10 # make x1 a function of other variables
DT$x3[3]=NA  # add an NA to show this works with missing data 
DT  

# now can add drop.index = F, but note that DT is now sorted by order(id,time)
pdata.frame(DT,index=c('id','time'),drop.index = F)

# order DT to match pdata.frame that will be used for plm
DT=DT[order(DT$id,DT$time),]

# formulas
formulas =Y~x1+x2+x3 
formulas_dataframe = Y~x1+x2+x3 +id+time # add id and time for model.frame

# estimate
random <- plm(formulas, data=DT, index=c("id", "time"), model="random",na.action = 'na.omit')
summary(random) 

# merge prediction and and model.frame 
fitted = data.frame(fitted = random$model[[1]] - random$residuals)
model_data = cbind(as.data.frame(as.matrix(random$model)),fitted)  # this isn't really needed but shows that input and model.frame are same
model_data = cbind(model_data,na.omit(model.frame(formulas_dataframe,DT)))  
model_data

回答4:

I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed-Effects models with plm.

The function further adds the predicted values to the indices of the original data. This is done by using the rownames saved within the plm - attributes(plmobject)$index and the rownames within the model.matrix

for more details see the function posted here:

https://stackoverflow.com/a/44185441/2409896

来源：https://stackoverflow.com/questions/23143428/merge-plm-fitted-values-to-dataset

标签

plm