On the web,i found that rbind()
is used to combine two data frames and the same task is performed by bind_rows()
Since none of the answers here offers a systematic review of the differences between base::rbind
and dplyr::bind_rows
, and the answer from kss regarding performance is incorrect, I decided to add the following.
Let's have some testing data frame:
df_1 = data.frame(
v1_dbl = 1:1000,
v2_lst = I(as.list(1:1000)),
v3_fct = factor(sample(letters[1:10], 1000, replace = TRUE)),
v4_raw = raw(1000),
v5_dtm = as.POSIXct(paste0("2019-12-0", sample(1:9, 1000, replace = TRUE)))
)
df_1$v2_lst = unclass(df_1$v2_lst) #remove the AsIs class introduced by `I()`
base::rbind
handles list inputs differentlyrbind(list(df_1, df_1))
[,1] [,2]
[1,] List,5 List,5
# You have to combine it with `do.call()` to achieve the same result:
head(do.call(rbind, list(df_1, df_1)), 3)
v1_dbl v2_lst v3_fct v4_raw v5_dtm
1 1 1 b 00 2019-12-02
2 2 2 h 00 2019-12-08
3 3 3 c 00 2019-12-09
head(dplyr::bind_rows(list(df_1, df_1)), 3)
v1_dbl v2_lst v3_fct v4_raw v5_dtm
1 1 1 b 00 2019-12-02
2 2 2 h 00 2019-12-08
3 3 3 c 00 2019-12-09
base::rbind
can cope with (some) mixed typesWhile both base::rbind
and dplyr::bind_rows
fail when trying to bind eg. raw or datetime column to a column of some other type, base::rbind
can cope with some degree of discrepancy.
Combining a list and a non-list column produces a list column. Combining a factor and something else produces a warning but not an error:
df_2 = data.frame(
v1_dbl = 1,
v2_lst = 1,
v3_fct = 1,
v4_raw = raw(1),
v5_dtm = as.POSIXct("2019-12-01")
)
head(rbind(df_1, df_2), 3)
v1_dbl v2_lst v3_fct v4_raw v5_dtm
1 1 1 b 00 2019-12-02
2 2 2 h 00 2019-12-08
3 3 3 c 00 2019-12-09
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 1) : invalid factor level, NA generated
# Fails on the lst, num combination:
head(dplyr::bind_rows(df_1, df_2), 3)
Error: Column `v2_lst` can't be converted from list to numeric
# Fails on the fct, num combination:
head(dplyr::bind_rows(df_1[-2], df_2), 3)
Error: Column `v3_fct` can't be converted from factor to numeric
base::rbind
keeps rownamesTidyverse advocates making rownames into a dedicated column, so its functions drop them.
rbind(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
dplyr::bind_rows(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
mpg cyl disp hp
1 21.0 6 160 110
2 21.0 6 160 110
3 22.8 4 108 93
4 21.4 6 258 110
base::rbind
cannot cope with missing columnsJust for completeness, since Abhilash Kandwal already said so in their answer.
base::rbind
handles named arguments differentlyWhile base::rbind
prepends argument names to rownames, dplyr::rbind
has the option to add a dedicated ID column:
rbind(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4])
mpg cyl disp hp
hi.Mazda RX4 21.0 6 160 110
hi.Mazda RX4 Wag 21.0 6 160 110
bye.Datsun 710 22.8 4 108 93
bye.Hornet 4 Drive 21.4 6 258 110
dplyr::bind_rows(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4], .id = "my_id")
my_id mpg cyl disp hp
1 hi 21.0 6 160 110
2 hi 21.0 6 160 110
3 bye 22.8 4 108 93
4 bye 21.4 6 258 110
base::rbind
makes vector arguments into rows (and recycles them)In contrast, dplyr::bind_rows
adds columns (and therefore requires the elements of x to be named):
rbind(mtcars[1:2, 1:4], x = 1:2))
mpg cyl disp hp
Mazda RX4 21 6 160 110
Mazda RX4 Wag 21 6 160 110
x 1 2 1 2
dplyr::bind_rows(mtcars[1:2, 1:4], x = c(a = 1, b = 2))
mpg cyl disp hp a b
1 21 6 160 110 NA NA
2 21 6 160 110 NA NA
3 NA NA NA NA 1 2
base::rbind
is slower and requires more RAMTo bind a hundred medium-sized data frames (1k rows), base::rbind
requires fifty times more RAM and is more than 15 times slower:
dfs = rep(list(df_1), 100)
bench::mark(
"base::rbind" = do.call(rbind, dfs),
"dplyr::bind_rows" = dplyr::bind_rows(dfs)
)[, 1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 base::rbind 47.23ms 48.05ms 20.0 104.48MB
2 dplyr::bind_rows 3.69ms 3.75ms 261. 2.39MB
Since I needed to bind lots of small data frames, here is a benchmark for that too. Both speed but especially RAM difference is quite striking:
dfs = rep(list(df_1[1:2, ]), 10^4)
bench::mark(
"base::rbind" = do.call(rbind, dfs),
"dplyr::bind_rows" = dplyr::bind_rows(dfs)
)[, 1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 base::rbind 1.65s 1.65s 0.605 1.56GB
2 dplyr::bind_rows 19.31ms 20.21ms 43.7 566.69KB
Finally, help("rbind")
and help("bind_rows")
are interesting to read, too.
Although bind_rows()
is more functional in the sense that it will combine data frames with different numbers of columns (assigning NA
to rows with those columns missing), if you are combining data frames with the same columns, I would recommend rbind()
.
rbind()
is much more computationally efficient in cases where the data you are combining are formatted the same way, and it simply throws an error when the number of columns is different. It will save you a lot of time for big data sets. I would highly recommend rbind()
for these situations. Nonetheless, if your data has different columns, then you have to use bind_rows()
.
Apart from few more differences, one of the main reasons for using bind_rows
over rbind
is to combine two data frames having different number of columns. rbind
throws an error in such a case whereas bind_rows
assigns "NA
" to those rows of columns missing in one of the data frames where the value is not provided by the data frames.
Try out the following code to see the difference:
a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)
Results for the two calls are as follows:
rbind(a, b)
> rbind(a, b)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
library(dplyr)
bind_rows(a, b)
> bind_rows(a, b)
a b c d
1 1 3 5 NA
2 2 4 6 NA
3 7 2 3 8
4 8 3 4 9