R fuzzy string match to return specific column based on matched string

我怕爱的太早我们不能终老 提交于 2019-11-26 21:58:09

问题


I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.

I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.

Below is a sample dummy data along with code that I have created to explain the situation

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.

I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).

Any kind of solution would help!!


回答1:


You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]



回答2:


Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

You can use stringdist method="dl" (or others that might work better).

To meet your requirement of "ensuring that I have same number of rows as in df1", I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance. This was suggested by dgrtwo, the developer of fuzzyjoin. (Hopefully it'll be part of the package itself in the future.)

(I also had to make an assumption to take the max year2 in the event of distance ties.)

Code:

library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2, 
                by = c("Address1" = "Address2"), 
                mode = "left", 
                method = "dl", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(Address1, Year1) %>%
  top_n(1, -dist) %>%
  top_n(1, Year2)

Result:

# A tibble: 7 x 6
# Groups:   Address1, Year1 [7]
                                Address1 Year1                             Address2 Year2 unique_id  dist
                                   <chr> <int>                                <chr> <int>     <chr> <dbl>
1                    786, GALI NO 5, XYZ  2001                   786, GALI NO 4 XYZ  2007    000007     2
2       rambo, 45, strret 4, atlast, pqr  2002 del, 546, strret2, towards east, pqr  2009    000009    17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR  2003                  23/4, STREET 2, PQR  2010    000010    19
4                    45-B, GALI NO5, XYZ  2004                  45B, GALI NO 5, XYZ  2008    000008     2
5                 HECTIC, 99 STREET, PQR  2005                  23/4, STREET 2, PQR  2010    000010    11
6                    786, GALI NO 5, XYZ  2006                   786, GALI NO 4 XYZ  2007    000007     2
7       rambo, 45, strret 4, atlast, pqr  2007 del, 546, strret2, towards east, pqr  2009    000009    17


来源:https://stackoverflow.com/questions/42749447/r-fuzzy-string-match-to-return-specific-column-based-on-matched-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!