问题
All,
I have a question that I fear might be too pedestrian to ask here, but searching for it elsewhere is leading me astray. I may not be using the right search terms.
I have a panel data frame (country-year) in R with some missing values on a given variable. I'm trying to impute them with the value from another vector in another data frame. Here's an illustration of what I am trying to do.
Assume Data
is the data frame of interest, which has missing values on a given vector that I'm trying to impute from another donor data frame. It looks like this.
country year x
70 1920 9.234
70 1921 9.234
70 1922 9.234
70 1923 9.234
70 1924 9.234
80 1920 NA
80 1921 NA
80 1922 NA
80 1923 NA
80 1924 NA
90 1920 7.562
90 1921 7.562
90 1922 7.562
90 1923 7.562
90 1924 7.562
This would be the Donor
frame, which has a value for country == 80
country x
70 9.234
80 1.523
90 7.562
I'm trying to find a seamless way to automate this, beyond a command of Data$x[Data$country == 80] <- 1.523
. There are a lot of countries with missingness on x
.
It may be worth clarifying that a simple merge
would be the easiest, but not necessarily appropriate for what I'm trying to do. Some countries will see variation on x
over different years. Basically, what I'm trying to accomplish is a command that says that if the value of x
is missing from Data
for all years for a given country, take the corresponding value for the country from the Donor
data and paste it over all country years as a "best guess" of sorts.
Thanks for any input. I suspect this is a rookie question, but I didn't know the right terms to search for it.
Reproducible code for the above data follows.
country <- c(70,70,70,70,70,80,80,80,80,80,90,90,90,90,90)
year <- c(1920,1921,1922,1923,1924,1920,1921,1922,1923,1924,1920,1921,1922,1923,1924)
x <- c(9.234,9.234,9.234,9.234,9.234,NA,NA,NA,NA,NA,7.562,7.562,7.562,7.562,7.562)
Data=data.frame(country=country,year=year,x=x)
summary(Data)
country <- c(70,80,90)
x <- c(9.234,1.523,7.562)
Donor=data.frame(country=country,x=x)
summary(Donor)
回答1:
Using merge
:
r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
Data$x = ifelse(is.na(r$x.Data), r$x.Donor, r$x.Data)
If for some reason idea of overwriting all values of x seems bad then use which
to overwrite only NAs (with the same merge):
r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
na.idx = which(is.na(Data$x))
Data[na.idx,"x"] = r[na.idx,"x.Donor"]
回答2:
Here's one option, should work generally:
#Get the vector of countries with missing x
country.na <- Data$country[is.na(Data$x)]
#Get corresponding location of x in Donor
index <- sapply(country.na, function(x) which(Donor$country == x))
#Replace NA values with corresponding values in Donor
Data$x[is.na(Data$x)] <- Donor$x[index]
Data
# country year x
# 1 70 1920 9.234
# 2 70 1921 9.234
# 3 70 1922 9.234
# 4 70 1923 9.234
# 5 70 1924 9.234
# 6 80 1920 1.523
# 7 80 1921 1.523
# 8 80 1922 1.523
# 9 80 1923 1.523
# 10 80 1924 1.523
# 11 90 1920 7.562
# 12 90 1921 7.562
# 13 90 1922 7.562
# 14 90 1923 7.562
# 15 90 1924 7.562
来源:https://stackoverflow.com/questions/17129667/using-r-to-insert-a-value-for-missing-data-with-a-value-from-another-data-frame