Finding similar rows (not duplicates) in a dataframe in R

问题

I have a dataset of >800k rows (example):

id     fieldA       fieldB              codeA   codeB
120    Similar one  addrs example1      929292  0006
3490   Similar oh   addrs example3      929292  0006
2012   CLOSE CAA    addrs example10232  kkda9a  0039
9058   CLASE CAC    addrs example01232  kkda9a  0039
9058   NON DONE     addrs example010193 kkda9a  0039
48848  OOO AD ADDD  addrs example18238  uyMMnn  8303

Field ID is an unique id, both fields codeA and codeB must be the same, but the fields fieldA and fieldB need a Levenshtein distance or similar function. I need to find which rows are very similar based on that. The output could be something on the lines of:

   codeA    codeB Similar
   929292   0006  120;3490
   kkda9a   0039  2012;9058
   kkda9a   0039  9058
   uyMMnn   8303  48848

A distance matrix for a dataset this big wouldn't work and wouldn't make much sense if I have 2 constrainsts like codeA and codeB. I guessing one approach would be a plyr function to split by codeA-codeB, but I'm stuck after that

For clarification, I want to group together all rows that have high similarity in both fieldA and fieldB, and have an exact match in codeA and codeB.

EDIT:

Following David DeWert idea, something along this line seem to work for each codeA-codeB group, not a nice output put seems a step in the right way:

library(stringdist)
clustering<-function(x){
  if(nrow(x)>1){{d<-stringdistmatrix(paste(x$fieldA,x$fieldB),paste(x$fieldA,x$fieldB),method = "qgram")
  rownames(d)<-x$id
  hc <- hclust(as.dist(d))
  #I need to evaluate correctly this cutting
  res<-cutree(hc,h=5)
  #This returns a list, one element for each cluster found and a named vector inside with the elements
  return(res)
  }else{
  res<-1
  names(res)<-x$id
  return(res)
  }
}

Now I need to find a way to split the dataframe in codeA-codeB groups and apply this function to them.

EDIT2:

I managed a "good enough" approach for this using the previous function clustering and the plyr package.

result<-dlply(testDF,.(codeA,codeB),clustering)

This creates a list with each of the "group by codeA,codeB" like:

$`929292.0006`
 120 3490 
   1    1 

$kkda9a.0039
2012 9058 9058 
   1    1    2 

$uyMMnn.8303
48848 
    1 

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
   codeA codeB
1 929292  0006
2 kkda9a  0039
3 uyMMnn  8303

Which effectively clusters by fieldA and fieldB the groups created by codeA and codeB. This doesn't get my desired output, but since I can't get a better solution, will have to do. My biggest gripe with this is that the nature of the plyr functions wont allow me to get more than 1 row by group (which makes complete sense) so I have to use list as a result instead of dataframe, not a real concern. The problem arises when the dataset is quite big (like this) and plyr doesn't work very well with them ... and the alternative dplyr package is not compatible with list results... oh well.

回答1:

Create a new field called "codeAB" to partition the data according to the codeA-codeB match like so:

data$codeAB <- factor(apply( data[ , c(4,5) ] , 1 , paste , collapse = "-" ))

Then cluster each of levels(data$codeAB) with Damerau-Levenshtein . People seem to be suggesting that ELKI http://en.wikipedia.org/wiki/ELKI is good at clustering large collections of data without building a distance matrix.

Someone was also asking about D-L metric in ELKI: Clustering string data with ELKI

I hope that helped.

来源：https://stackoverflow.com/questions/28460051/finding-similar-rows-not-duplicates-in-a-dataframe-in-r

标签

duplicates

stringdist