R, find duplicated rows , regardless of order

问题

I've been thinking this problem for a whole night: here is my matrix:

'a' '#' 3
'#' 'a' 3
 0  'I am' 2
'I am' 0 2

.....

I want to treat the rows like the first two rows are the same, because it's just different order of 'a' and '#'. In my case, I want to delete such kind of rows. The toy example is simple, the first two are the same, the third and the forth are the same. but in my data set, I don't know where is the 'same' row.

I'm writing in R. Thanks.

回答1:

Perhaps something like this would work for you. It is not clear what your desired output is though.

x <- structure(c("a", "#", "0", "I am", "#", "a", "I am", "0", "3", 
                 "3", "2", "2"), .Dim = c(4L, 3L))
x
#      [,1]   [,2]   [,3]
# [1,] "a"    "#"    "3" 
# [2,] "#"    "a"    "3" 
# [3,] "0"    "I am" "2" 
# [4,] "I am" "0"    "2" 


duplicated(
  lapply(1:nrow(x), function(y){
    A <- x[y, ]
    A[order(A)]
  }))
# [1] FALSE  TRUE FALSE  TRUE

This basically splits the matrix up by row, then sorts each row. duplicated works on lists too, so you just wrap the whole thing with `duplicated to find which items (rows) are duplicated.

回答2:

For me, this produced also just a vector of FALSE, meaning that it detected no duplicates. I think this is what happened: I had column names assigned in x. Thus, although order(A) ordered the row neatly and returns the ordered version of the row with column names, the resulting object from lapply respects the column names and hands over to duplicated() a version where the columns are intact (because of the names). Thus, what is considered by duplicated() is the same as x!

I did this inspired by the answer of @A Handcart And Mohair which worked for me:

duplicated(t(apply(x, 1, sort)))

It is also shorter ;)

Note that the example by @A Handcart And Mohair works with his sample data. But if you have named columns, it fails.

回答3:

As a start, you might want to refer to the documentation for an excellent R package called duplicated. As the package notes, "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates." Some examples that they provide are:

Example 1:

duplicated(iris)[140:143]

Example 2:

duplicated(iris3, MARGIN = c(1, 3))

Example3

anyDuplicated(iris)

Example 4

anyDuplicated(x)

Example 5

anyDuplicated(x, fromLast = TRUE)

EDIT: If you wanted to do it the long way, you might think of comparing every row to every other row in the data from character by character. To do this, imagine that the first row has 3 characters. For each row, you loop through and check to see if they have this character. If they do, you then reduce and check the next character. Approaching this using a self created recursive function which compares a value in a string to all other rows in the dataframe or matrix (and then subsets ONLY on rows that do not match any other rows), could work.

来源：https://stackoverflow.com/questions/22980423/r-find-duplicated-rows-regardless-of-order

标签

duplicate-data