I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller
I have a solution that does not require data.table
but if the set is huge could run with package:parallel
rbind.pages(
parallel::mclapply(Address1, function(i){
data.frame(
src = i,
match = Address2[which.min(adist(i, Address2))]
)
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`src (Address1)`= 1, `match (Address2)` = 2)
Which then gives the output solution:
src (Address1) match (Address2)
1 786, GALI NO 5, XYZ 786, GALI NO 4 XYZ
2 rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR 23/4, STREET 2, PQR
4 45-B, GALI NO5, XYZ 45B, GALI NO 5, XYZ
5 HECTIC, 99 STREET, PQR 23/4, STREET 2, PQR
I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time
rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]
system.time({
test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
calc$totals <- (rowSums(calc))
calc %>% mutate(src = i, target = rand_addy_two) %>%
filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`source Address1` = src, `target Address2(matched)` = target,
insertions = ins, deletions = del, substitutions = sub,
total_approx_dist = totals)
})
user system elapsed
24.940 1.480 3.384
> nrow(test_one)
[1] 600000
Now to reverse and apply the larger set to the smaller:
system.time({
test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
calc$totals <- (rowSums(calc))
calc %>% mutate(src = i, target = rand_addy_one) %>%
filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2)) %>%
select(`source Address2` = src, `target Address1(matched)` = target,
insertions = ins, deletions = del, substitutions = sub,
total_approx_dist = totals)
})
user system elapsed
27.512 1.280 4.077
nrow(test_two)
[1] 720000