I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R
so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R
solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs)
binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1))
finds how many rows each data frame has which is passed to rep
in rep(names(dfs), vapply(dfs, nrow, numeric(1)))
to repeat the name of the data frame once for each row of the data frame. cbind
puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind
rather than one per data frame.