R String match for address using stringdist, stringdistmatrix

前端 未结 1 1632
盖世英雄少女心
盖世英雄少女心 2021-01-07 15:36

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller

相关标签:
1条回答
  • 2021-01-07 16:37

    I have a solution that does not require data.table but if the set is huge could run with package:parallel

     rbind.pages(
      parallel::mclapply(Address1, function(i){
        data.frame(
           src = i, 
           match = Address2[which.min(adist(i, Address2))]
         )
       }, mc.cores = parallel::detectCores() - 2)) %>% 
     select(`src (Address1)`= 1, `match (Address2)` = 2)
    

    Which then gives the output solution:

                              src (Address1)                     match (Address2)
    1                    786, GALI NO 5, XYZ                   786, GALI NO 4 XYZ
    2       rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
    3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR                  23/4, STREET 2, PQR
    4                    45-B, GALI NO5, XYZ                  45B, GALI NO 5, XYZ
    5                 HECTIC, 99 STREET, PQR                  23/4, STREET 2, PQR
    

    Edit:

    I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time

    rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
    rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]
    
    
    system.time({
      test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
        calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
        calc$totals <- (rowSums(calc))
        calc %>% mutate(src = i, target = rand_addy_two) %>% 
          filter(totals == min(totals))
      }, mc.cores = parallel::detectCores() - 2))  %>% 
        select(`source Address1` = src, `target Address2(matched)` = target,
               insertions = ins, deletions = del, substitutions = sub,
               total_approx_dist = totals)
    })
    
       user  system elapsed 
     24.940   1.480   3.384 
    
    > nrow(test_one)
    [1] 600000
    

    Now to reverse and apply the larger set to the smaller:

    system.time({
       test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
        calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
        calc$totals <- (rowSums(calc))
        calc %>% mutate(src = i, target = rand_addy_one) %>% 
            filter(totals == min(totals))
    }, mc.cores = parallel::detectCores() - 2))  %>% 
        select(`source Address2` = src, `target Address1(matched)` = target,
               insertions = ins, deletions = del, substitutions = sub,
               total_approx_dist = totals)
    })
    
       user  system elapsed 
     27.512   1.280   4.077 
    
    nrow(test_two)
    [1] 720000
    
    0 讨论(0)
提交回复
热议问题