I have a data.frame A and a data.frame B which contains a subset of A
How can I create a data.frame C which is data.frame A with data.frame B excluded? Thanks for your h
If B is truly a subset of A, which you can check with:
if(!identical(A[rownames(B), , drop = FALSE], B)) stop("B is not a subset of A!")
then you can filter by rownames:
C <- A[!rownames(A) %in% rownames(B), , drop = FALSE]
or
C <- A[setdiff(rownames(A), rownames(B)), , drop = FALSE]
Here are two data.table
solutions that will be memory and time efficient
render_markdown(strict = T)
library(data.table)
# some biggish data
set.seed(1234)
ADT <- data.table(x = seq.int(1e+07), y = seq.int(1e+07))
.rows <- sample(nrow(ADT), 30000)
# Random subset of A in B
BDT <- ADT[.rows, ]
# set keys for fast merge
setkey(ADT, x)
setkey(BDT, x)
## how CDT <- ADT[-ADT[BDT,which=T]] the data as `data.frames for fastest
## alternative
A <- copy(ADT)
setattr(A, "class", "data.frame")
B <- copy(BDT)
setattr(B, "class", "data.frame")
f2 <- function() noBDT <- ADT[-ADT[BDT, which = T]]
f3 <- function() noBDT2 <- ADT[-BDT[, x]]
f1 <- function() noB <- A[-as.integer(rownames(B)), ]
library(rbenchmark)
benchmark(base = f1(),DT = f2(), DT2 = f3(), replications = 3)
## test replications elapsed relative user.self sys.self
## 2 DT 3 0.92 1.108 0.77 0.15
## 1 base 3 3.72 4.482 3.19 0.52
## 3 DT2 3 0.83 1.000 0.72 0.11
get the rows in A that aren't in B
C = A[! data.frame(t(A)) %in% data.frame(t(B)), ]
This is not the fastest and is likely to be very slow but is an alternative to mplourde's that takes into account the row data and should work on mixed data which flodel critiqued. It relies on the paste2 function from the qdap package which doesn't exist yet as I plan to release it within the enxt month or 2:
Paste 2 function:
paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){
if (trim) multi.columns <- lapply(multi.columns, function(x) {
gsub("^\\s+|\\s+$", "", x)
}
)
if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
multi.columns <- do.call('cbind', multi.columns)
}
m <- if(handle.na){
apply(multi.columns, 1, function(x){if(any(is.na(x))){
NA
} else {
paste(x, collapse = sep)
}
}
)
} else {
apply(multi.columns, 1, paste, collapse = sep)
}
names(m) <- NULL
return(m)
}
# Flodel's mixed data set:
A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]
# My approach:
A[!paste2(A)%in%paste2(B), ]
A <- data.frame(x = 1:10, y = 1:10)
#Random subset of A in B
B <- A[sample(nrow(A),3),]
#get A that is not in B
C <- A[-as.integer(rownames(B)),]
Performance test vis-a-vis mplourde's answer:
library(rbenchmark)
f1 <- function() A[- as.integer(rownames(B)),]
f2 <- function() A[! data.frame(t(A)) %in% data.frame(t(B)), ]
benchmark(f1(), f2(), replications = 10000,
columns = c("test", "elapsed", "relative"),
order = "elapsed"
)
test elapsed relative
1 f1() 1.531 1.0000
2 f2() 8.846 5.7779
Looking at the rownames is approximately 6x faster. Two calls to transpose can get expensive computationally.
If this B data set is truly a nested version of the first data set there has to be indexing that created this data set to begin with. IMHO we shouldn't be discussing the differences between the data sets but negating the original indexing that created the B data set to begin with. Here's an example of what I mean:
A <- mtcars
B <- mtcars[mtcars$cyl==6, ]
C <- mtcars[mtcars$cyl!=6, ]