Using R to insert a value for missing data with a value from another data frame

问题

All,

I have a question that I fear might be too pedestrian to ask here, but searching for it elsewhere is leading me astray. I may not be using the right search terms.

I have a panel data frame (country-year) in R with some missing values on a given variable. I'm trying to impute them with the value from another vector in another data frame. Here's an illustration of what I am trying to do.

Assume Data is the data frame of interest, which has missing values on a given vector that I'm trying to impute from another donor data frame. It looks like this.

country    year      x
  70       1920    9.234
  70       1921    9.234
  70       1922    9.234
  70       1923    9.234
  70       1924    9.234
  80       1920      NA
  80       1921      NA
  80       1922      NA
  80       1923      NA
  80       1924      NA
  90       1920    7.562
  90       1921    7.562
  90       1922    7.562
  90       1923    7.562
  90       1924    7.562

This would be the Donor frame, which has a value for country == 80

country      x
  70       9.234
  80       1.523
  90       7.562

I'm trying to find a seamless way to automate this, beyond a command of Data$x[Data$country == 80] <- 1.523. There are a lot of countries with missingness on x.

It may be worth clarifying that a simple merge would be the easiest, but not necessarily appropriate for what I'm trying to do. Some countries will see variation on x over different years. Basically, what I'm trying to accomplish is a command that says that if the value of x is missing from Data for all years for a given country, take the corresponding value for the country from the Donor data and paste it over all country years as a "best guess" of sorts.

Thanks for any input. I suspect this is a rookie question, but I didn't know the right terms to search for it.

Reproducible code for the above data follows.

country <- c(70,70,70,70,70,80,80,80,80,80,90,90,90,90,90)
year <- c(1920,1921,1922,1923,1924,1920,1921,1922,1923,1924,1920,1921,1922,1923,1924)
x <- c(9.234,9.234,9.234,9.234,9.234,NA,NA,NA,NA,NA,7.562,7.562,7.562,7.562,7.562)

Data=data.frame(country=country,year=year,x=x)
summary(Data)

country <- c(70,80,90)
x <- c(9.234,1.523,7.562)
Donor=data.frame(country=country,x=x)
summary(Donor)

回答1:

Using merge:

r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
Data$x = ifelse(is.na(r$x.Data), r$x.Donor, r$x.Data)

If for some reason idea of overwriting all values of x seems bad then use which to overwrite only NAs (with the same merge):

r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
na.idx = which(is.na(Data$x))
Data[na.idx,"x"] = r[na.idx,"x.Donor"]

回答2:

Here's one option, should work generally:

#Get the vector of countries with missing x
country.na <- Data$country[is.na(Data$x)]
#Get corresponding location of x in Donor
index <- sapply(country.na, function(x) which(Donor$country == x))
#Replace NA values with corresponding values in Donor
Data$x[is.na(Data$x)] <- Donor$x[index]
Data
#    country year     x
# 1       70 1920 9.234
# 2       70 1921 9.234
# 3       70 1922 9.234
# 4       70 1923 9.234
# 5       70 1924 9.234
# 6       80 1920 1.523
# 7       80 1921 1.523
# 8       80 1922 1.523
# 9       80 1923 1.523
# 10      80 1924 1.523
# 11      90 1920 7.562
# 12      90 1921 7.562
# 13      90 1922 7.562
# 14      90 1923 7.562
# 15      90 1924 7.562

来源：https://stackoverflow.com/questions/17129667/using-r-to-insert-a-value-for-missing-data-with-a-value-from-another-data-frame

标签

missing-data

data-manipulation