问题
I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.
###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000)
c(NA,NA,foo)}))
###Simulate national life expectancy values
nat_life <- c(replicate(20L, {
foo <- rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
c(NA,NA,NA,foo)}))
###Construct the data.table
data.sim <- data.table( GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)),
DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5),
Year = seq.int(1990,1994,by=1L),
National_gdp = nat_gdp ,
National_life_exp = nat_life )
回答1:
I assume that you want to do the linear model on each DistrictID
separately.
Original data table:
head(data.sim)
## GovernorateID DistrictID Year National_gdp National_life_exp
## 1: 11 1101 1990 NA NA
## 2: 11 1101 1991 NA NA
## 3: 11 1101 1992 1988.746 NA
## 4: 11 1101 1993 2527.619 54.70739
## 5: 11 1101 1994 3854.210 44.21809
## 6: 11 1102 1990 NA NA
dd <- copy(data.sim) # Make a copy for later.
Replace NA
elements in each with the prediction of a linear model. Two steps in one chained operation.
data.sim[, National_life_exp := ifelse(is.na(National_life_exp),
predict(lm(National_life_exp ~ Year, data=.SD), .SD),
National_life_exp)
, by=DistrictID
][, National_gdp := ifelse(is.na(National_gdp),
predict(lm(National_gdp ~ Year, data=.SD), .SD),
National_gdp)
, by=DistrictID
]
head(data.sim)
## GovernorateID DistrictID Year National_gdp National_life_exp
## 1: 11 1101 1990 -8.004377 86.17531
## 2: 11 1101 1991 924.727559 75.68601
## 3: 11 1101 1992 1988.745871 65.19670
## 4: 11 1101 1993 2527.618676 54.70739
## 5: 11 1101 1994 3854.209743 44.21809
## 6: 11 1102 1990 1008.886661 70.45643
A nice (?) plot. Note that each level of DistrictID
has exactly two non-NA points in this example.
plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.
来源:https://stackoverflow.com/questions/15605772/how-to-efficiently-extrapolate-missing-data-for-multiple-variables