问题
I am going through documentation of data.table
and also noticed from some of the conversations over here on SO that rbindlist
is supposed to be better than rbind
.
I would like to know why is rbindlist
better than rbind
and in which scenarios rbindlist
really excels over rbind
?
Is there any advantage in terms of memory utilization?
回答1:
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist
shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.
回答2:
By v1.9.2
, rbindlist
had evolved quite a bit, implementing many features including:
- Choosing the highest
SEXPTYPE
of columns while binding - implemented inv1.9.2
closing FR #2456 and Bug #4981.- Handling
factor
columns properly - first implemented inv1.8.10
closing Bug #2650 and extended to binding ordered factors carefully inv1.9.2
as well, closing FR #4856 and Bug #5019.
In addition, in v1.9.2
, rbind.data.table
also gained a fill
argument, that allows to bind by filling missing columns, implemented in R.
Now in v1.9.3
, there are even more improvements on these existing features:
rbindlist
gains an argumentuse.names
, which by default isFALSE
for backwards compatibility.rbindlist
also gains an argumentfill
, which by default is alsoFALSE
for backwards compatibility.- These features are all implemented in C, and written carefully to not compromise in speed while adding functionalities.
- Since
rbindlist
can now match by names and fill missing columns,rbind.data.table
just callsrbindlist
now. The only difference is thatuse.names=TRUE
by default forrbind.data.table
, for backwards compatibility.
rbind.data.frame
slows down quite a bit mostly due to copies (which @mnel points out as well) that could be avoided (by moving to C). I think that's not the only reason. The implementation for checking/matching column names in rbind.data.frame
could also get slower when there are many columns per data.frame and there are many such data.frames to bind (as shown in the benchmark below).
However, that rbindlist
lack(ed) certain features (like checking factor levels or matching names) bears very tiny (or no) weight towards it being faster than rbind.data.frame
. It's because they were carefully implemented in C, optimised for speed and memory.
Here's a benchmark that highlights the efficient binding while matching by column names as well using rbindlist
's use.names
feature from v1.9.3
. The data set consists of 10000 data.frames each of size 10*500.
NB: this benchmark has been updated to include a comparison to dplyr
's bind_rows
library(data.table) # 1.11.5, 2018-06-02 00:09:06 UTC
library(dplyr) # 0.7.5.9000, 2018-06-12 01:41:40 UTC
set.seed(1L)
names = paste0("V", 1:500)
cols = 500L
foo <- function() {
data = as.data.frame(setDT(lapply(1:cols, function(x) sample(10))))
setnames(data, sample(names))
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
system.time(ans1 <- rbindlist(ll))
# user system elapsed
# 1.226 0.070 1.296
system.time(ans2 <- rbindlist(ll, use.names=TRUE))
# user system elapsed
# 2.635 0.129 2.772
system.time(ans3 <- do.call("rbind", ll))
# user system elapsed
# 36.932 1.628 38.594
system.time(ans4 <- bind_rows(ll))
# user system elapsed
# 48.754 0.384 49.224
identical(ans2, setDT(ans3))
# [1] TRUE
identical(ans2, setDT(ans4))
# [1] TRUE
Binding columns as such without checking for names took just 1.3 where as checking for column names and binding appropriately took just 1.5 seconds more. Compared to base solution, this is 14x faster, and 18x faster than dplyr
's version.
来源:https://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind