In R - fastest way pairwise comparing character strings on similarity

青春壹個敷衍的年華 提交于 2019-12-22 01:34:40

问题


I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks?

Say I have the following data.frame:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                      v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                      v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

I want to compare each pair of rows in df on their JaroWinkler similarity.

With some help of others (see this post), I've been able to construct this code:

#columns to compare 
testCols <- c("names", "v1", "v2")

#compare pairs
RowCompare= function(x){
 comp <- NULL
 pairs <- t(combn(nrow(x),2))
 for(i in 1:nrow(pairs)){
   row_a <- pairs[i,1]
   row_b <- pairs[i,2]
   a_tests <- x[row_a,testCols]
   b_tests <- x[row_b,testCols]
 comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
 }

colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}

#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}

This generates the correct output:

output = as.data.frame(RowCompare(df))

> output
   row_a row_b   names_j      v1_j      v2_j
1      1     2 0.4444444 0.1111111 0.0000000
2      1     3 0.3571429 0.0000000 0.1111111
3      1     4 0.4444444 0.1111111 0.1111111
4      1     5 0.4444444 0.1111111 0.1111111  
5      2     3 0.4603175 0.1111111 0.1111111
6      2     4 0.3333333 0.0000000 0.1111111
7      2     5 0.3333333 0.0000000 0.1111111
8      3     4 0.5634921 0.1111111 0.0000000
9      3     5 0.5634921 0.1111111 0.0000000
10     4     5 0.0000000 0.0000000 0.0000000

However, my real data.frame has 8 million observations and I make 17 comparisons. To run this code takes days...

I am looking for ways to speed up this process:

  • Should I use matrices instead of data.frames?
  • How to parallelize this process?
  • Vectorize?

回答1:


If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix. Using a form of lapply or purrr::map will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy, which will turn them into nicely formatted data.frames. If you use purrr::map_df and use its .id parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread.

If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2 or Map.

Altogether,

library(tidyverse)

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
    map_df(broom::tidy, .id = 'var') %>% 
    spread(var, distance)

##    item1 item2 names        v1        v2
## 1      2     1     1 0.1111111 0.0000000
## 2      3     1     1 0.0000000 0.1111111
## 3      3     2     1 0.1111111 0.1111111
## 4      4     1     1 0.1111111 0.1111111
## 5      4     2     1 0.0000000 0.1111111
## 6      4     3     1 0.1111111 0.0000000
## 7      5     1     1 0.1111111 0.1111111
## 8      5     2     1 0.0000000 0.1111111
## 9      5     3     1 0.1111111 0.0000000
## 10     5     4     0 0.0000000 0.0000000

Note that while choose(5, 2) returns 10 observations, choose(8000000, 2) returns 3.2e+13 (32 trillion) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets.



来源:https://stackoverflow.com/questions/42321383/in-r-fastest-way-pairwise-comparing-character-strings-on-similarity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!