Join and overwrite data in one table with data from another table

问题

How to join and overwrite data appears to be a common request, but I have yet to find an elegant solution that applies to an entire dataset.

(Note: to simplify the data, I will use only 1s and NAs for values and a small subset of columns, but in reality I have hundreds of columns with different values).

I have one data table (d1) that has NA values in certain columns and rows.

library(data.table)
d1 = fread(
"r id v1 v2 v3
1  A  1  1  1
2  B  1  1  1
3  C  1 NA NA
4  D  1  1 NA
5  E  1 NA  1")[, r := NULL]

And I have another data table (d2) that consists of additional columns as well as data points missing from existing columns in d1.

d2 = fread(
"r id v2 v3 v4 v5
1  C  1  1  1  1
2  D  1  1  1  1
3  E  1  1  1  1")[, r := NULL ]

I would like to basically join + overwrite d1 with all the data in d2, making sure of course to match rows by id and columns by name, as shown below.

> d12
  id v1 v2 v3 v4 v5
1  A  1  1  1 NA NA
2  B  1  1  1 NA NA
3  C  1  1  1  1  1
4  D  1  1  1  1  1
5  E  1  1  1  1  1

Additional scenario: I'd also like to know how this can be done if you only want to update the NA values in d1, that is, make sure existing non-NA values are not overwritten. (To make this easier to visualize, I'm including new tables with both 1s and 0s).

For example, if we have d3

d3 = fread(
"r id v1 v2 v3
1  A  1  1  1
2  B  1  1  1
3  C  1  0 NA
4  D  1  1  0
5  E  1 NA  1")[, r := NULL ]

And we want to join d2 and overwrite only NAs to get:

> d32
  id v1 v2 v3 v4 v5
1  A  1  1  1 NA NA
2  B  1  1  1 NA NA
3  C  1  0  1  1  1
4  D  1  1  0  1  1
5  E  1  1  1  1  1

FYI, below are some other posts addressing this problem but only for one or two columns. The solution I'm looking for should allow the data in one table to be overwritten by many if not all of the columns in another table.

Merge data frames and overwrite values

Merge two data frame and replace the NA value in R

A data.table-based solution would be preferred, but others are welcome.

回答1:

I think it's easiest to go to long form:

md1 = melt(d2, id="id")
md2 = melt(d2, id="id")

Then you can stack them and take the latest value:

res1 = unique(rbind(md1, md2), by=c("id", "variable"), fromLast=TRUE)

I'd also like to know how this can be done if you only want to update the NA values in [d3], that is, make sure existing non-NA values are not overwritten.

You can exclude rows from the update table, md2, if they appear in md3:

md3 = melt(d3, id="id")

res3 = unique(rbind(md3, md2[!md3, on=.(id, variable)]), 
  by=c("id", "variable"), fromLast=TRUE)

dcast can be used to go back to wide format if necessary, e.g., dcast(res3, id ~ ...).

回答2:

Here's @Frank's solution from the comments. (Note: d1 and d2 need to be defined as data.table first).

library(data.table)
cols = setdiff(intersect(names(d1), names(d2)), "id") 
d1[d2, on=.(id), (cols) := mget(paste0("i.", cols))]

As he notes, the original solution I provided below is a bad idea generally speaking. If ids appear multiple times or in a different order, it will do the wrong thing.

~~d1[d1$id %in% d2$id, names(d2):=d2]~~

回答3:

library("dplyr")

d12 <- anti_join(d1, d2, by = "id") %>%
         bind_rows(d2)

This solution takes the rows from d1 that aren't in d2, then adds the d2 rows on to them.

This won't work for the 'Additional scenario', which looks much much messier to resolve, and maybe should be a separate question.

来源：https://stackoverflow.com/questions/46761065/join-and-overwrite-data-in-one-table-with-data-from-another-table

标签

data.table

overwrite