问题
I have a dataset of >800k rows (example):
id fieldA fieldB codeA codeB
120 Similar one addrs example1 929292 0006
3490 Similar oh addrs example3 929292 0006
2012 CLOSE CAA addrs example10232 kkda9a 0039
9058 CLASE CAC addrs example01232 kkda9a 0039
9058 NON DONE addrs example010193 kkda9a 0039
48848 OOO AD ADDD addrs example18238 uyMMnn 8303
Field ID is an unique id, both fields codeA and codeB must be the same, but the fields fieldA and fieldB need a Levenshtein distance or similar function. I need to find which rows are very similar based on that. The output could be something on the lines of:
codeA codeB Similar
929292 0006 120;3490
kkda9a 0039 2012;9058
kkda9a 0039 9058
uyMMnn 8303 48848
A distance matrix for a dataset this big wouldn't work and wouldn't make much sense if I have 2 constrainsts like codeA and codeB. I guessing one approach would be a plyr function to split by codeA-codeB, but I'm stuck after that
For clarification, I want to group together all rows that have high similarity in both fieldA and fieldB, and have an exact match in codeA and codeB.
EDIT:
Following David DeWert idea, something along this line seem to work for each codeA-codeB group, not a nice output put seems a step in the right way:
library(stringdist)
clustering<-function(x){
if(nrow(x)>1){{d<-stringdistmatrix(paste(x$fieldA,x$fieldB),paste(x$fieldA,x$fieldB),method = "qgram")
rownames(d)<-x$id
hc <- hclust(as.dist(d))
#I need to evaluate correctly this cutting
res<-cutree(hc,h=5)
#This returns a list, one element for each cluster found and a named vector inside with the elements
return(res)
}else{
res<-1
names(res)<-x$id
return(res)
}
}
Now I need to find a way to split the dataframe in codeA-codeB groups and apply this function to them.
EDIT2:
I managed a "good enough" approach for this using the previous function clustering and the plyr package.
result<-dlply(testDF,.(codeA,codeB),clustering)
This creates a list with each of the "group by codeA,codeB" like:
$`929292.0006`
120 3490
1 1
$kkda9a.0039
2012 9058 9058
1 1 2
$uyMMnn.8303
48848
1
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
codeA codeB
1 929292 0006
2 kkda9a 0039
3 uyMMnn 8303
Which effectively clusters by fieldA and fieldB the groups created by codeA and codeB. This doesn't get my desired output, but since I can't get a better solution, will have to do. My biggest gripe with this is that the nature of the plyr functions wont allow me to get more than 1 row by group (which makes complete sense) so I have to use list as a result instead of dataframe, not a real concern. The problem arises when the dataset is quite big (like this) and plyr doesn't work very well with them ... and the alternative dplyr package is not compatible with list results... oh well.
回答1:
Create a new field called "codeAB" to partition the data according to the codeA-codeB match like so:
data$codeAB <- factor(apply( data[ , c(4,5) ] , 1 , paste , collapse = "-" ))
Then cluster each of levels(data$codeAB)
with Damerau-Levenshtein .
People seem to be suggesting that ELKI http://en.wikipedia.org/wiki/ELKI is good at clustering large collections of data without building a distance matrix.
Someone was also asking about D-L metric in ELKI: Clustering string data with ELKI
I hope that helped.
来源:https://stackoverflow.com/questions/28460051/finding-similar-rows-not-duplicates-in-a-dataframe-in-r