问题
I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks?
Say I have the following data.frame
:
df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"),
v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"),
v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))
I want to compare each pair of rows in df
on their JaroWinkler similarity.
With some help of others (see this post), I've been able to construct this code:
#columns to compare
testCols <- c("names", "v1", "v2")
#compare pairs
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}
#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}
This generates the correct output:
output = as.data.frame(RowCompare(df))
> output
row_a row_b names_j v1_j v2_j
1 1 2 0.4444444 0.1111111 0.0000000
2 1 3 0.3571429 0.0000000 0.1111111
3 1 4 0.4444444 0.1111111 0.1111111
4 1 5 0.4444444 0.1111111 0.1111111
5 2 3 0.4603175 0.1111111 0.1111111
6 2 4 0.3333333 0.0000000 0.1111111
7 2 5 0.3333333 0.0000000 0.1111111
8 3 4 0.5634921 0.1111111 0.0000000
9 3 5 0.5634921 0.1111111 0.0000000
10 4 5 0.0000000 0.0000000 0.0000000
However, my real data.frame has 8 million observations and I make 17 comparisons. To run this code takes days...
I am looking for ways to speed up this process:
- Should I use matrices instead of data.frames?
- How to parallelize this process?
- Vectorize?
回答1:
If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix
. Using a form of lapply
or purrr::map
will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy
, which will turn them into nicely formatted data.frames. If you use purrr::map_df
and use its .id
parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread
.
If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2
or Map
.
Altogether,
library(tidyverse)
map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>%
map_df(broom::tidy, .id = 'var') %>%
spread(var, distance)
## item1 item2 names v1 v2
## 1 2 1 1 0.1111111 0.0000000
## 2 3 1 1 0.0000000 0.1111111
## 3 3 2 1 0.1111111 0.1111111
## 4 4 1 1 0.1111111 0.1111111
## 5 4 2 1 0.0000000 0.1111111
## 6 4 3 1 0.1111111 0.0000000
## 7 5 1 1 0.1111111 0.1111111
## 8 5 2 1 0.0000000 0.1111111
## 9 5 3 1 0.1111111 0.0000000
## 10 5 4 0 0.0000000 0.0000000
Note that while choose(5, 2)
returns 10 observations, choose(8000000, 2)
returns 3.2e+13 (32 trillion) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix
does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets.
来源:https://stackoverflow.com/questions/42321383/in-r-fastest-way-pairwise-comparing-character-strings-on-similarity