问题
I have 2 datasets and want to do fuzzy join.
Here is the two datasets.
library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)
Two datasets have the same characters in State
and type
; however, columns NAME
are not the same. They are similar.
Although I can subtract columns NAME
on each data with 3 or 4 charters and then merge them, it seems that the correct ratio may not high due to large observations.
dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)
The method is bad.
I check for package fuzzyjoin
. But not sure whether I am correct or not.
library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
The result in this exercise is correct, see below. But If any NAME in these two data are the same, the answer will not correct.
I create a new observation in these two data.
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB
This is incorrect result. Any suggestion?
It seems that I cannot use fuzzy_full_join
.
回答1:
It is because you asked fuzzy_full_join to give you NAMES that did not match (with !=) and then state and types that did match (with == ==). So if in the case all three do match, it won't show up.
You could run it twice with:
match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))
library(data.table); library(fuzzyjoin)
#> Warning: package 'data.table' was built under R version 3.5.2
dt1 <- fread("NAME State type
ABERCOMBIETOWNSHIP ND TS
ABERDEENTOWNSHIP NJ TS
ABERDEENTOWNSHIP SD TS
ABBOTSFORDCITY WI CI
ABERDEENCITY WA CI
ADATOWNSHIP MI TS
ADAMS IL TS
THESAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEENTWPNJ NJ TS
ABERDEENWASH WA CI
ABBOTSFORDWIS WI CI
ADATWPMICH MI TS
ADAOHIO OH CI
ADAMSMASS MA CI
ADAMSVILLEALA AL CI
THESAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: ABERDEENTOWNSHIP NJ TS ABERDEENTWPNJ NJ TS
#> 2: ABBOTSFORDCITY WI CI ABBOTSFORDWIS WI CI
#> 3: ABERDEENCITY WA CI ABERDEENWASH WA CI
#> 4: ADATOWNSHIP MI TS ADATWPMICH MI TS
#> 5: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 6: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 7: ADAMS IL TS <NA> <NA> <NA>
#> 8: THESAME AA BB <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ADAOHIO OH CI
#> 10: <NA> <NA> <NA> ADAMSMASS MA CI
#> 11: <NA> <NA> <NA> ADAMSVILLEALA AL CI
#> 12: <NA> <NA> <NA> THESAME AA BB
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`==`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: THESAME AA BB THESAME AA BB
#> 2: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 3: ABERDEENTOWNSHIP NJ TS <NA> <NA> <NA>
#> 4: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 5: ABBOTSFORDCITY WI CI <NA> <NA> <NA>
#> 6: ABERDEENCITY WA CI <NA> <NA> <NA>
#> 7: ADATOWNSHIP MI TS <NA> <NA> <NA>
#> 8: ADAMS IL TS <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ABERDEENTWPNJ NJ TS
#> 10: <NA> <NA> <NA> ABERDEENWASH WA CI
#> 11: <NA> <NA> <NA> ABBOTSFORDWIS WI CI
#> 12: <NA> <NA> <NA> ADATWPMICH MI TS
#> 13: <NA> <NA> <NA> ADAOHIO OH CI
#> 14: <NA> <NA> <NA> ADAMSMASS MA CI
#> 15: <NA> <NA> <NA> ADAMSVILLEALA AL CI
Created on 2019-03-17 by the reprex package (v0.2.1)
来源:https://stackoverflow.com/questions/51958119/fuzzy-outer-join-merge-in-r