How to join and overwrite data appears to be a common request, but I have yet to find an elegant solution that applies to an entire dataset.
(Note: to simplify the d
I think it's easiest to go to long form:
md1 = melt(d2, id="id")
md2 = melt(d2, id="id")
Then you can stack them and take the latest value:
res1 = unique(rbind(md1, md2), by=c("id", "variable"), fromLast=TRUE)
I'd also like to know how this can be done if you only want to update the NA values in [
d3
], that is, make sure existing non-NA values are not overwritten.
You can exclude rows from the update table, md2
, if they appear in md3
:
md3 = melt(d3, id="id")
res3 = unique(rbind(md3, md2[!md3, on=.(id, variable)]),
by=c("id", "variable"), fromLast=TRUE)
dcast
can be used to go back to wide format if necessary, e.g., dcast(res3, id ~ ...)
.
Here's @Frank's solution from the comments. (Note: d1 and d2 need to be defined as data.table first).
library(data.table)
cols = setdiff(intersect(names(d1), names(d2)), "id")
d1[d2, on=.(id), (cols) := mget(paste0("i.", cols))]
As he notes, the original solution I provided below is a bad idea generally speaking. If ids appear multiple times or in a different order, it will do the wrong thing.
d1[d1$id %in% d2$id, names(d2):=d2]
library("dplyr")
d12 <- anti_join(d1, d2, by = "id") %>%
bind_rows(d2)
This solution takes the rows from d1
that aren't in d2
, then adds the d2
rows on to them.
This won't work for the 'Additional scenario', which looks much much messier to resolve, and maybe should be a separate question.