How to efficiently extrapolate missing data for multiple variables

问题

I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.

###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
  foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000) 
  c(NA,NA,foo)}))
###Simulate national life expectancy values

nat_life <- c(replicate(20L, {
  foo <-  rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
  c(NA,NA,NA,foo)}))




###Construct the data.table       
data.sim <- data.table(  GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)), 
                         DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5), 
                         Year = seq.int(1990,1994,by=1L),
                         National_gdp =  nat_gdp   , 
                         National_life_exp =    nat_life  )

回答1:

I assume that you want to do the linear model on each DistrictID separately.

Original data table:

head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990           NA                NA
## 2:            11       1101 1991           NA                NA
## 3:            11       1101 1992     1988.746                NA
## 4:            11       1101 1993     2527.619          54.70739
## 5:            11       1101 1994     3854.210          44.21809
## 6:            11       1102 1990           NA                NA

dd <- copy(data.sim) # Make a copy for later.

Replace NA elements in each with the prediction of a linear model. Two steps in one chained operation.

data.sim[, National_life_exp := ifelse(is.na(National_life_exp), 
                                       predict(lm(National_life_exp ~ Year, data=.SD), .SD),
                                       National_life_exp)
         , by=DistrictID
         ][, National_gdp := ifelse(is.na(National_gdp),
                                    predict(lm(National_gdp ~ Year, data=.SD), .SD),
                                    National_gdp) 
           , by=DistrictID
        ]


head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990    -8.004377          86.17531
## 2:            11       1101 1991   924.727559          75.68601
## 3:            11       1101 1992  1988.745871          65.19670
## 4:            11       1101 1993  2527.618676          54.70739
## 5:            11       1101 1994  3854.209743          44.21809
## 6:            11       1102 1990  1008.886661          70.45643

A nice (?) plot. Note that each level of DistrictID has exactly two non-NA points in this example.

plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.

enter image description here

来源：https://stackoverflow.com/questions/15605772/how-to-efficiently-extrapolate-missing-data-for-multiple-variables

标签

time-series

linear-regression

missing-data