Difference between rbind() and bind_rows() in R

前端 未结 3 1027
盖世英雄少女心
盖世英雄少女心 2020-12-08 14:36

On the web,i found that rbind() is used to combine two data frames and the same task is performed by bind_rows()

相关标签:
3条回答
  • 2020-12-08 14:46

    Since none of the answers here offers a systematic review of the differences between base::rbind and dplyr::bind_rows, and the answer from kss regarding performance is incorrect, I decided to add the following.

    Let's have some testing data frame:

    df_1 = data.frame(
      v1_dbl = 1:1000,
      v2_lst = I(as.list(1:1000)),
      v3_fct = factor(sample(letters[1:10], 1000, replace = TRUE)),
      v4_raw = raw(1000),
      v5_dtm = as.POSIXct(paste0("2019-12-0", sample(1:9, 1000, replace = TRUE)))
    )
    
    df_1$v2_lst = unclass(df_1$v2_lst) #remove the AsIs class introduced by `I()`
    

    1. base::rbind handles list inputs differently

    rbind(list(df_1, df_1))
         [,1]   [,2]  
    [1,] List,5 List,5
    
    # You have to combine it with `do.call()` to achieve the same result:
    head(do.call(rbind, list(df_1, df_1)), 3)
      v1_dbl v2_lst v3_fct v4_raw     v5_dtm
    1      1      1      b     00 2019-12-02
    2      2      2      h     00 2019-12-08
    3      3      3      c     00 2019-12-09
    
    head(dplyr::bind_rows(list(df_1, df_1)), 3)
      v1_dbl v2_lst v3_fct v4_raw     v5_dtm
    1      1      1      b     00 2019-12-02
    2      2      2      h     00 2019-12-08
    3      3      3      c     00 2019-12-09
    

    2. base::rbind can cope with (some) mixed types

    While both base::rbind and dplyr::bind_rows fail when trying to bind eg. raw or datetime column to a column of some other type, base::rbind can cope with some degree of discrepancy.

    Combining a list and a non-list column produces a list column. Combining a factor and something else produces a warning but not an error:

    df_2 = data.frame(
      v1_dbl = 1,
      v2_lst = 1,
      v3_fct = 1,
      v4_raw = raw(1),
      v5_dtm = as.POSIXct("2019-12-01")
    )
    
    head(rbind(df_1, df_2), 3)
      v1_dbl v2_lst v3_fct v4_raw     v5_dtm
    1      1      1      b     00 2019-12-02
    2      2      2      h     00 2019-12-08
    3      3      3      c     00 2019-12-09
    Warning message:
    In `[<-.factor`(`*tmp*`, ri, value = 1) : invalid factor level, NA generated
    
    # Fails on the lst, num combination:
    head(dplyr::bind_rows(df_1, df_2), 3)
    Error: Column `v2_lst` can't be converted from list to numeric
    
    # Fails on the fct, num combination:
    head(dplyr::bind_rows(df_1[-2], df_2), 3)
    Error: Column `v3_fct` can't be converted from factor to numeric
    

    3. base::rbind keeps rownames

    Tidyverse advocates making rownames into a dedicated column, so its functions drop them.

    rbind(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
                    mpg cyl disp  hp
    Mazda RX4      21.0   6  160 110
    Mazda RX4 Wag  21.0   6  160 110
    Datsun 710     22.8   4  108  93
    Hornet 4 Drive 21.4   6  258 110
    
    dplyr::bind_rows(mtcars[1:2, 1:4], mtcars[3:4, 1:4])
       mpg cyl disp  hp
    1 21.0   6  160 110
    2 21.0   6  160 110
    3 22.8   4  108  93
    4 21.4   6  258 110
    

    4. base::rbind cannot cope with missing columns

    Just for completeness, since Abhilash Kandwal already said so in their answer.

    5. base::rbind handles named arguments differently

    While base::rbind prepends argument names to rownames, dplyr::rbind has the option to add a dedicated ID column:

    rbind(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4])
                        mpg cyl disp  hp
    hi.Mazda RX4       21.0   6  160 110
    hi.Mazda RX4 Wag   21.0   6  160 110
    bye.Datsun 710     22.8   4  108  93
    bye.Hornet 4 Drive 21.4   6  258 110
    
    dplyr::bind_rows(hi = mtcars[1:2, 1:4], bye = mtcars[3:4, 1:4], .id = "my_id")
      my_id  mpg cyl disp  hp
    1    hi 21.0   6  160 110
    2    hi 21.0   6  160 110
    3   bye 22.8   4  108  93
    4   bye 21.4   6  258 110
    

    6. base::rbind makes vector arguments into rows (and recycles them)

    In contrast, dplyr::bind_rows adds columns (and therefore requires the elements of x to be named):

    rbind(mtcars[1:2, 1:4], x = 1:2))
                  mpg cyl disp  hp
    Mazda RX4      21   6  160 110
    Mazda RX4 Wag  21   6  160 110
    x               1   2    1   2
    
    dplyr::bind_rows(mtcars[1:2, 1:4], x = c(a = 1, b = 2))
      mpg cyl disp  hp  a  b
    1  21   6  160 110 NA NA
    2  21   6  160 110 NA NA
    3  NA  NA   NA  NA  1  2
    

    7. base::rbind is slower and requires more RAM

    To bind a hundred medium-sized data frames (1k rows), base::rbind requires fifty times more RAM and is more than 15 times slower:

    dfs = rep(list(df_1), 100)
    bench::mark(
      "base::rbind" = do.call(rbind, dfs),
      "dplyr::bind_rows" = dplyr::bind_rows(dfs)
    )[, 1:5]
    
    # A tibble: 2 x 5
      expression            min   median `itr/sec` mem_alloc
      <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>
    1 base::rbind       47.23ms  48.05ms      20.0  104.48MB
    2 dplyr::bind_rows   3.69ms   3.75ms     261.     2.39MB
    

    Since I needed to bind lots of small data frames, here is a benchmark for that too. Both speed but especially RAM difference is quite striking:

    dfs = rep(list(df_1[1:2, ]), 10^4)
    bench::mark(
      "base::rbind" = do.call(rbind, dfs),
      "dplyr::bind_rows" = dplyr::bind_rows(dfs)
    )[, 1:5]
    
    # A tibble: 2 x 5
      expression            min   median `itr/sec` mem_alloc
      <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>
    1 base::rbind         1.65s    1.65s     0.605    1.56GB
    2 dplyr::bind_rows  19.31ms  20.21ms    43.7    566.69KB
    

    Finally, help("rbind") and help("bind_rows") are interesting to read, too.

    0 讨论(0)
  • 2020-12-08 14:59

    Although bind_rows() is more functional in the sense that it will combine data frames with different numbers of columns (assigning NA to rows with those columns missing), if you are combining data frames with the same columns, I would recommend rbind().

    rbind() is much more computationally efficient in cases where the data you are combining are formatted the same way, and it simply throws an error when the number of columns is different. It will save you a lot of time for big data sets. I would highly recommend rbind() for these situations. Nonetheless, if your data has different columns, then you have to use bind_rows().

    0 讨论(0)
  • 2020-12-08 15:02

    Apart from few more differences, one of the main reasons for using bind_rows over rbind is to combine two data frames having different number of columns. rbind throws an error in such a case whereas bind_rows assigns "NA" to those rows of columns missing in one of the data frames where the value is not provided by the data frames.

    Try out the following code to see the difference:

    a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
    b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)
    

    Results for the two calls are as follows:

    rbind(a, b)
    > rbind(a, b)
    Error in rbind(deparse.level, ...) : 
      numbers of columns of arguments do not match
    
    library(dplyr)
    bind_rows(a, b)
    > bind_rows(a, b)
      a b c  d
    1 1 3 5 NA
    2 2 4 6 NA
    3 7 2 3  8
    4 8 3 4  9
    
    0 讨论(0)
提交回复
热议问题