How to fuzzy join based on multiple columns and conditions?

问题

I'm trying to left join two data frames (df1, df2). The data frames have two columns in common: zone and slope. Zone is a factor column and slope is numeric.

    df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
    df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))
    df1
    df2

I want to join the data frames such that they are first matched exactly on zone, and then the closest match for slope. If there are two slope values that equidistant, it doesn't matter if the join rounds up or down as long as the rule is applied consistently and it does not result in duplicate rows.

I'd prefer to do this with a fuzzy_join or dplyr rather than data.table.

The result should look something like:

    df3 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)), other = c(rep("a", 3), rep("b",1), rep("c",2)))
    df3

where the value of "other" is first determined by zone, and then the closest slope.

I've tried:

    distance_left_join(df, df2, by=c("zone"= "zone", "slope"="slope"))

as well as other types of fuzzy joins, but I think they may not be working because the columns are of different types. I suspect there is a fuzzy_left_join solution, but I don't understand how to create a match function.

回答1:

Here is how to do fuzzy joins with multiple match_funs. If you want to mix complex match_funs, you'll have to define them with a function yourself as I did here: Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3)))
df2 = data.frame(slope = c(2.4, 2.4,6.2), zone = c(rep("Low", 1), rep("High", 2)), other = c(rep("a", 1), rep("b", 1), rep("c", 1)))

library(fuzzyjoin); library(dplyr)

# First, need to define match_fun_distance. 
# This is copied from the source code for distance_join in https://github.com/dgrtwo/fuzzyjoin
match_fun_distance <- function(v1, v2) {
  
  # settings for this method
  method = "manhattan"
  max_dist = 99
  distance_col = "dist"
  
  if (is.null(dim(v1))) {
    v1 <- t(t(v1))
    v2 <- t(t(v2))
  }
  if (method == "euclidean") {
    d <- sqrt(rowSums((v1 - v2)^2))
  }
  else if (method == "manhattan") {
    d <- rowSums(abs(v1 - v2))
  }
  ret <- tibble::tibble(instance = d <= max_dist)
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- d
  }
  ret
}

(joined_result <- fuzzy_join(df1, df2, 
                             by=c("zone"= "zone", "slope"="slope"), 
                             match_fun = list(`==`, match_fun_distance),
                             mode = "left"))
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#> 1       1    Low     2.4    Low     a        1.4        NA
#> 2       2    Low     2.4    Low     a        0.4        NA
#> 3       3    Low     2.4    Low     a        0.6        NA
#> 4       4   High     2.4   High     b        1.6        NA
#> 5       4   High     6.2   High     c        2.2        NA
#> 6       5   High     2.4   High     b        2.6        NA
#> 7       5   High     6.2   High     c        1.2        NA
#> 8       6   High     2.4   High     b        3.6        NA
#> 9       6   High     6.2   High     c        0.2        NA

joined_result %>%
  group_by(slope.x, zone.x) %>%
  top_n(1, -slope.dist)
#> # A tibble: 6 x 7
#> # Groups:   slope.x, zone.x [6]
#>   slope.x zone.x slope.y zone.y other slope.dist zone.dist
#>     <int> <fct>    <dbl> <fct>  <fct>      <dbl>     <dbl>
#> 1       1 Low        2.4 Low    a          1.4          NA
#> 2       2 Low        2.4 Low    a          0.400        NA
#> 3       3 Low        2.4 Low    a          0.6          NA
#> 4       4 High       2.4 High   b          1.6          NA
#> 5       5 High       6.2 High   c          1.2          NA
#> 6       6 High       6.2 High   c          0.2          NA

^{Created on 2020-10-20 by the reprex package (v0.3.0)}

来源：https://stackoverflow.com/questions/56009863/how-to-fuzzy-join-based-on-multiple-columns-and-conditions

标签

fuzzyjoin