R filtering out a subset

前端 未结 6 1347
误落风尘
误落风尘 2021-01-29 13:30

I have a data.frame A and a data.frame B which contains a subset of A

How can I create a data.frame C which is data.frame A with data.frame B excluded? Thanks for your h

相关标签:
6条回答
  • 2021-01-29 14:10

    If B is truly a subset of A, which you can check with:

    if(!identical(A[rownames(B), , drop = FALSE], B)) stop("B is not a subset of A!")
    

    then you can filter by rownames:

    C <- A[!rownames(A) %in% rownames(B), , drop = FALSE]
    

    or

    C <- A[setdiff(rownames(A), rownames(B)), , drop = FALSE]
    
    0 讨论(0)
  • 2021-01-29 14:19

    Here are two data.table solutions that will be memory and time efficient

    render_markdown(strict = T)
    library(data.table)
    # some biggish data
    set.seed(1234)
    ADT <- data.table(x = seq.int(1e+07), y = seq.int(1e+07))
    
    .rows <- sample(nrow(ADT), 30000)
    # Random subset of A in B
    BDT <- ADT[.rows, ]
    
    # set keys for fast merge
    setkey(ADT, x)
    setkey(BDT, x)
    ## how CDT <- ADT[-ADT[BDT,which=T]] the data as `data.frames for fastest
    ## alternative
    A <- copy(ADT)
    setattr(A, "class", "data.frame")
    B <- copy(BDT)
    setattr(B, "class", "data.frame")
    f2 <- function() noBDT <- ADT[-ADT[BDT, which = T]]
    f3 <- function() noBDT2 <- ADT[-BDT[, x]]
    f1 <- function() noB <- A[-as.integer(rownames(B)), ]
    
    library(rbenchmark)
    benchmark(base = f1(),DT = f2(), DT2 = f3(), replications = 3)
    
    ##   test replications elapsed relative user.self sys.self 
    ## 2   DT            3    0.92    1.108      0.77     0.15       
    ## 1  base           3    3.72    4.482      3.19     0.52        
    ## 3  DT2            3    0.83    1.000      0.72     0.11     
    
    0 讨论(0)
  • 2021-01-29 14:23

    get the rows in A that aren't in B

    C = A[! data.frame(t(A)) %in% data.frame(t(B)), ]
    
    0 讨论(0)
  • 2021-01-29 14:29

    This is not the fastest and is likely to be very slow but is an alternative to mplourde's that takes into account the row data and should work on mixed data which flodel critiqued. It relies on the paste2 function from the qdap package which doesn't exist yet as I plan to release it within the enxt month or 2:

    Paste 2 function:

    paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){
    
        if (trim) multi.columns <- lapply(multi.columns, function(x) {
                gsub("^\\s+|\\s+$", "", x)
            }
        )
    
        if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
            multi.columns <- do.call('cbind', multi.columns)
          }
    
        m <- if(handle.na){
                     apply(multi.columns, 1, function(x){if(any(is.na(x))){
                           NA
                     } else {
                           paste(x, collapse = sep)
                     }
                 }
             )   
             } else {
              apply(multi.columns, 1, paste, collapse = sep)
        }
        names(m) <- NULL
        return(m)
    }
    

    # Flodel's mixed data set:

    A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]
    

    # My approach:

    A[!paste2(A)%in%paste2(B), ]
    
    0 讨论(0)
  • 2021-01-29 14:31
    A <- data.frame(x = 1:10, y = 1:10)
    #Random subset of A in B
    B <- A[sample(nrow(A),3),]
    #get A that is not in B
    C <- A[-as.integer(rownames(B)),]
    

    Performance test vis-a-vis mplourde's answer:

    library(rbenchmark)
    f1 <- function() A[- as.integer(rownames(B)),]
    f2 <- function() A[! data.frame(t(A)) %in% data.frame(t(B)), ]
    benchmark(f1(), f2(), replications = 10000, 
              columns = c("test", "elapsed", "relative"),
              order = "elapsed"
              )
    
      test elapsed relative
    1 f1()   1.531   1.0000
    2 f2()   8.846   5.7779
    

    Looking at the rownames is approximately 6x faster. Two calls to transpose can get expensive computationally.

    0 讨论(0)
  • 2021-01-29 14:33

    If this B data set is truly a nested version of the first data set there has to be indexing that created this data set to begin with. IMHO we shouldn't be discussing the differences between the data sets but negating the original indexing that created the B data set to begin with. Here's an example of what I mean:

    A <- mtcars
    B <- mtcars[mtcars$cyl==6, ]
    C <- mtcars[mtcars$cyl!=6, ]
    
    0 讨论(0)
提交回复
热议问题