R, find duplicated rows , regardless of order

前端 未结 3 437
遥遥无期
遥遥无期 2020-12-04 01:04

I\'ve been thinking this problem for a whole night: here is my matrix:

\'a\' \'#\' 3
\'#\' \'a\' 3
 0  \'I am\' 2
\'I am\' 0 2

.....

<
相关标签:
3条回答
  • 2020-12-04 01:13

    Perhaps something like this would work for you. It is not clear what your desired output is though.

    x <- structure(c("a", "#", "0", "I am", "#", "a", "I am", "0", "3", 
                     "3", "2", "2"), .Dim = c(4L, 3L))
    x
    #      [,1]   [,2]   [,3]
    # [1,] "a"    "#"    "3" 
    # [2,] "#"    "a"    "3" 
    # [3,] "0"    "I am" "2" 
    # [4,] "I am" "0"    "2" 
    
    
    duplicated(
      lapply(1:nrow(x), function(y){
        A <- x[y, ]
        A[order(A)]
      }))
    # [1] FALSE  TRUE FALSE  TRUE
    

    This basically splits the matrix up by row, then sorts each row. duplicated works on lists too, so you just wrap the whole thing with `duplicated to find which items (rows) are duplicated.

    0 讨论(0)
  • 2020-12-04 01:26

    For me, this produced also just a vector of FALSE, meaning that it detected no duplicates. I think this is what happened: I had column names assigned in x. Thus, although order(A) ordered the row neatly and returns the ordered version of the row with column names, the resulting object from lapply respects the column names and hands over to duplicated() a version where the columns are intact (because of the names). Thus, what is considered by duplicated() is the same as x!

    I did this inspired by the answer of @A Handcart And Mohair which worked for me:

    duplicated(t(apply(x, 1, sort)))
    

    It is also shorter ;)

    Note that the example by @A Handcart And Mohair works with his sample data. But if you have named columns, it fails.

    0 讨论(0)
  • 2020-12-04 01:31

    As a start, you might want to refer to the documentation for an excellent R package called duplicated. As the package notes, "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates." Some examples that they provide are:

    Example 1:

    duplicated(iris)[140:143]
    

    Example 2:

    duplicated(iris3, MARGIN = c(1, 3))
    

    Example3

    anyDuplicated(iris)
    

    Example 4

    anyDuplicated(x)
    

    Example 5

    anyDuplicated(x, fromLast = TRUE)
    

    EDIT: If you wanted to do it the long way, you might think of comparing every row to every other row in the data from character by character. To do this, imagine that the first row has 3 characters. For each row, you loop through and check to see if they have this character. If they do, you then reduce and check the next character. Approaching this using a self created recursive function which compares a value in a string to all other rows in the dataframe or matrix (and then subsets ONLY on rows that do not match any other rows), could work.

    0 讨论(0)
提交回复
热议问题