Match two datasets across multiple ‘dirty’ columns in R

问题

I frequently need to match two datasets by multiple matching columns, for two reasons. First, each of these characteristics are ‘dirty’, meaning a single column does not consistently match even when it should (for a truly matching row). Second, the characteristics are not unique (e.g., male and female). Matching like this is useful for matching across time (pre-test with post-test scores), different data modalities (observed characteristics and lab values), or multiple datasets for research participants.

I need a heuristic that selects the best match. Then I can perform analyses of the two together, as described in this question. Note there are many matching columns, and many IDs, so they must both be specified as a list or vector. As an example, I have created two datasets below to match. In the example, DT1 row 1 (ID 1) is the best match for DT2 row 1 (ID 55), even though only the ‘match4’ column matches; this is because DT2 rows 2 and 3 are better matches for DT1 rows 2 and 3. Bonus: DT1 row 7 equally matches DT2 rows 7 and 8, but has a partial match to DT2 row 7, so ideally that would be selected.

Question: For DT1, select a “best guess” for the matching row from DT2, and use each row from DT2 only once. What is the best way to do this (in an efficient and “best practices” idiomatic way) in R?

My preliminary approach: I created a third data.table with a column of IDs from DT1, called DTmatch. All subsequent columns will be IDs from DT2. For the second column of DTmatch (named after the first ID of DT2), each value should represent the count of matching columns (in this example, 0 to 4). Next, find the highest match values in the matching table unique to each row and column. Lastly, create a final column that specifies the DT2 ID that matches the DT1 ID (column 1 in DTmatch).

library(data.table)
# In this example, the datasets are matched by row number, but the real data is not.
DT1 = data.table(
  ID = 1:7,
  match1 = c("b","b","b","a","a","c",NA),
  match2 = c(7, 8, 9, NA, NA, NA, NA),
  match3 = c(0, 0, 0, "j", 13:15),
  match4 = c(rep("m", 4), rep("f", 3)),
  value1 = 45:51,
  value2 = 100:106
)

DT2 = data.table(
  ID = 55:62,
  match1 = c("b","b",4,"a","a","c","j","j"),
  match2 = c(77, 8:14),
  match3 = c(9:14, 155, 16),
  match4 = c(rep("m", 4), NA, rep("f", 3)),
  value1 = 145:152,
  value2 = 101:108
)

# Fix numeric IDs
DT1[, ID := make.names(ID)]
DT2[, ID := make.names(ID)]

# Make new matching table
DTmatch <- DT1[, .(make.names(ID))]
setnames(DTmatch, old = "V1", new = "DT1ID")

# Start with one ID and one matching column
DT2ID <- DT2$ID[1]
DTmatch[, (DT2ID) := 0]
matchingCols <- c("match1")

# Code for first ID and match1, to be adapted for all IDs and all columns
DTmatch[, (DT2ID) := eval(parse(text=DT2ID)) + as.numeric(DT1[, (matchingCols), with=F] == DT2[ID==DT2ID, matchingCols, with=F][[1]])]

# First attempt at matching doesn't work due to NAs
for (thisID in DT2$ID) {
  DTmatch[, (thisID) := 0]
  for (matchingCol in matchingCols) {
#    if (!is.na(DT1[, matchingCol, with=F]) & !is.na(DT2[ID==thisID, matchingCol, with=F])) {
      DTmatch[, (thisID) := eval(parse(text=thisID)) + as.numeric(DT1[, (matchingCol), with=F] == DT2[ID==thisID, matchingCol, with=F][[1]])]
#    }
  }
}

回答1:

Perhaps this is an option to start with:

first, create a new column, by pasting all values from the match-columns together

#create new column based on matching cols
DT1[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]
DT2[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]

Then, using the fuzzyjoin-package, you can perform a join based on string-distance. Below, the maximum distance is set to 2. So if no matching string is found within a distance of 2, the result of the join will be <NA>.
You can/should experiment with the different stringdist-methods and the maximum distance...

library(fuzzyjoin)
result <- stringdist_join( DT2, DT1, 
                           by = "col_join", 
                           max_dist = 2, 
                           mode = "left", 
                           distance_col = "string_distance" )

result[,c(1,8,9,16,17)][]
# ID.x col_join.x ID.y col_join.y string_distance
# 1:   55      b779m    1       b70m               2
# 2:   56      b810m    1       b70m               2
# 3:   56      b810m    2       b80m               1
# 4:   56      b810m    3       b90m               2
# 5:   57      4911m   NA       <NA>              NA
# 6:   58     a1012m   NA       <NA>              NA
# 7:   59    a1113NA   NA       <NA>              NA
# 8:   60     c1214f    6     cNA14f               2
# 9:   61    j13155f   NA       <NA>              NA
# 10:   62     j1416f   NA       <NA>              NA

As you can see, you will still have to figure out some stuff, like "what to do with NA-values".
With Fuzzy joining there is always (in my opinion) a lot of trail-and-error involved. And a lot of times you will have to accept that 'the perfect answer' is just not out there...

来源：https://stackoverflow.com/questions/54740113/match-two-datasets-across-multiple-dirty-columns-in-r

标签

merge

data.table

fuzzy-comparison

record-linkage