R String match for address using stringdist, stringdistmatrix

前端未结

关注

 1  1632

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller

相关标签:

1条回答

醉酒成梦

2021-01-07 16:37

I have a solution that does not require data.table but if the set is huge could run with package:parallel

 rbind.pages(
  parallel::mclapply(Address1, function(i){
    data.frame(
       src = i, 
       match = Address2[which.min(adist(i, Address2))]
     )
   }, mc.cores = parallel::detectCores() - 2)) %>% 
 select(`src (Address1)`= 1, `match (Address2)` = 2)

Which then gives the output solution:

                          src (Address1)                     match (Address2)
1                    786, GALI NO 5, XYZ                   786, GALI NO 4 XYZ
2       rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR                  23/4, STREET 2, PQR
4                    45-B, GALI NO5, XYZ                  45B, GALI NO 5, XYZ
5                 HECTIC, 99 STREET, PQR                  23/4, STREET 2, PQR

Edit:

I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time

rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]


system.time({
  test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_two) %>% 
      filter(totals == min(totals))
  }, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address1` = src, `target Address2(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 24.940   1.480   3.384 

> nrow(test_one)
[1] 600000

Now to reverse and apply the larger set to the smaller:

system.time({
   test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_one) %>% 
        filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address2` = src, `target Address1(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 27.512   1.280   4.077 

nrow(test_two)
[1] 720000

0 讨论(0)