R filtering out a subset

前端未结

关注

 6  1352

误落风尘

I have a data.frame A and a data.frame B which contains a subset of A

How can I create a data.frame C which is data.frame A with data.frame B excluded? Thanks for your h

相关标签:

6条回答

眼角桃花

2021-01-29 14:10

If B is truly a subset of A, which you can check with:

if(!identical(A[rownames(B), , drop = FALSE], B)) stop("B is not a subset of A!")

then you can filter by rownames:

C <- A[!rownames(A) %in% rownames(B), , drop = FALSE]

C <- A[setdiff(rownames(A), rownames(B)), , drop = FALSE]

0 讨论(0)

-上瘾入骨i

2021-01-29 14:19

Here are two data.table solutions that will be memory and time efficient

render_markdown(strict = T)
library(data.table)
# some biggish data
set.seed(1234)
ADT <- data.table(x = seq.int(1e+07), y = seq.int(1e+07))

.rows <- sample(nrow(ADT), 30000)
# Random subset of A in B
BDT <- ADT[.rows, ]

# set keys for fast merge
setkey(ADT, x)
setkey(BDT, x)
## how CDT <- ADT[-ADT[BDT,which=T]] the data as `data.frames for fastest
## alternative
A <- copy(ADT)
setattr(A, "class", "data.frame")
B <- copy(BDT)
setattr(B, "class", "data.frame")
f2 <- function() noBDT <- ADT[-ADT[BDT, which = T]]
f3 <- function() noBDT2 <- ADT[-BDT[, x]]
f1 <- function() noB <- A[-as.integer(rownames(B)), ]

library(rbenchmark)
benchmark(base = f1(),DT = f2(), DT2 = f3(), replications = 3)

##   test replications elapsed relative user.self sys.self 
## 2   DT            3    0.92    1.108      0.77     0.15       
## 1  base           3    3.72    4.482      3.19     0.52        
## 3  DT2            3    0.83    1.000      0.72     0.11

0 讨论(0)

长情又很酷

2021-01-29 14:23
get the rows in A that aren't in B
```
C = A[! data.frame(t(A)) %in% data.frame(t(B)), ]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

旧巷少年郎

2021-01-29 14:29

This is not the fastest and is likely to be very slow but is an alternative to mplourde's that takes into account the row data and should work on mixed data which flodel critiqued. It relies on the paste2 function from the qdap package which doesn't exist yet as I plan to release it within the enxt month or 2:

Paste 2 function:

paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){

    if (trim) multi.columns <- lapply(multi.columns, function(x) {
            gsub("^\\s+|\\s+$", "", x)
        }
    )

    if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
        multi.columns <- do.call('cbind', multi.columns)
      }

    m <- if(handle.na){
                 apply(multi.columns, 1, function(x){if(any(is.na(x))){
                       NA
                 } else {
                       paste(x, collapse = sep)
                 }
             }
         )   
         } else {
          apply(multi.columns, 1, paste, collapse = sep)
    }
    names(m) <- NULL
    return(m)
}

# Flodel's mixed data set:

A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]

# My approach:

A[!paste2(A)%in%paste2(B), ]

0 讨论(0)

礼貌的吻别

2021-01-29 14:31

A <- data.frame(x = 1:10, y = 1:10)
#Random subset of A in B
B <- A[sample(nrow(A),3),]
#get A that is not in B
C <- A[-as.integer(rownames(B)),]

Performance test vis-a-vis mplourde's answer:

library(rbenchmark)
f1 <- function() A[- as.integer(rownames(B)),]
f2 <- function() A[! data.frame(t(A)) %in% data.frame(t(B)), ]
benchmark(f1(), f2(), replications = 10000, 
          columns = c("test", "elapsed", "relative"),
          order = "elapsed"
          )

  test elapsed relative
1 f1()   1.531   1.0000
2 f2()   8.846   5.7779

Looking at the rownames is approximately 6x faster. Two calls to transpose can get expensive computationally.

0 讨论(0)

温柔的废话

2021-01-29 14:33
If this B data set is truly a nested version of the first data set there has to be indexing that created this data set to begin with. IMHO we shouldn't be discussing the differences between the data sets but negating the original indexing that created the B data set to begin with. Here's an example of what I mean:
```
A <- mtcars
B <- mtcars[mtcars$cyl==6, ]
C <- mtcars[mtcars$cyl!=6, ]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...