Joining/matching data frames in R

僤鯓⒐⒋嵵緔 提交于 2019-12-03 22:02:08

Edit after data was provided:

Taking data definition from @MKR's post:

library(fuzzyjoin)
fuzzy_left_join(Table_1, Table_2,match_fun = function(x,y)  y> x & y<=1.1*x )
# Joining by: "x"
#   x.x  y  x.y  z
# 1   1 25 1.10 30
# 2   2 26 2.05 35
# 3   3 27   NA NA

general explanations with fake data (first answer)

fake data

iris1 <- head(iris[1:2])
iris1
#   Sepal.Length Sepal.Width
# 1          5.1         3.5
# 2          4.9         3.0
# 3          4.7         3.2
# 4          4.6         3.1
# 5          5.0         3.6
# 6          5.4         3.9

iris2 <- head(iris[c(1,3)])
set.seed(1)

# add noise
iris2$Sepal.Length <- iris2$Sepal.Length + rnorm(6,sd=0.05)

# shuffle rows
iris2 <- iris2[sample(seq(nrow(iris2))),]

iris2
#   Sepal.Length Petal.Length
# 5     5.016475          1.4
# 2     4.909182          1.4
# 4     4.679764          1.5
# 6     5.358977          1.7
# 3     4.658219          1.3
# 1     5.068677          1.4

code

library(fuzzyjoin)
fuzzy_left_join(iris1,iris2,match_fun= function(x,y) y>0.99*x & y<1.01*x )
# Joining by: "Sepal.Length"
# Sepal.Length.x Sepal.Width Sepal.Length.y Petal.Length
# 1            5.1         3.5       5.068677          1.4
# 2            4.9         3.0       4.909182          1.4
# 3            4.7         3.2       4.679764          1.5
# 4            4.7         3.2       4.658219          1.3
# 5            4.6         3.1             NA           NA
# 6            5.0         3.6       5.016475          1.4
# 7            5.4         3.9       5.358977          1.7

We see that some rows matched really well, let's take a look at the exceptions. The row number 4 had too much noise added in iris2, so it got paired with row 3, which has 2 matches. As I chose a left join, row 4 is still shown, but with NAs for iris2's columns.

As I understand it:

  • The joining columns will be expanded
  • The function takes these long columns (6*6==36 elements here) as arguments
  • We an apply vectorized functions (such as < or & in this case) to return a vector of logical that will filter these long columns in order to build the output data.frame.

distance_left_join is more straightforward to use, but then it's absolute distance, not relative.

An option using sqldf can be achieved as:

library(sqldf)


sqldf("select T1.x, T1.y, A.z from Table_1 T1
left join (select Table_1.x, Table_1.y, Table_2.z from Table_1 
   left join Table_2 where round((100*abs(Table_1.x - Table_2.x)/Table_1.x),0) <= 10) A 
on T1.x = A.x")

#   x  y  z
# 1 1 25 30
# 2 2 26 35
# 3 3 27 NA

Data:

Table_1 <- read.table(text = 
"x  y  
1   25  
2   26  
3   27",
header = TRUE)


Table_2 <- read.table(text = 
"x  z  
1.1    30  
2.05   35  
3.8    34",
header = TRUE)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!