Combine (rbind) data frames and create column with name of original data frames

后端未结

关注

 6  1580

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came

相关标签:

6条回答

爱一瞬间的悲伤

2020-11-22 13:45
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
```
> do.call(rbind, list(df1 = df1, df2 = df2))
      x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
```
Notice that the row names now reflect the source data.frames.

Update: Use cbind and rbind

Another option is to make a basic function like the following:
```
AppendMe <- function(dfNames) {
  do.call(rbind, lapply(dfNames, function(x) {
    cbind(get(x), source = x)
  }))
}
```
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
```
> AppendMe(c("df1", "df2"))
  x y source
1 1 2    df1
2 3 4    df1
3 5 6    df2
4 7 8    df2
```
Update 2: Use combine from the "gdata" package
```
> library(gdata)
> combine(df1, df2)
  x y source
1 1 2    df1
2 3 4    df1
3 5 6    df2
4 7 8    df2
```
Update 3: Use rbindlist from "data.table"

Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
```
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
   .id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
```
Update 4: use map_df from "purrr"

Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
```
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]

    src     x     y
  (chr) (int) (int)
1   df1     1     2
2   df1     3     4
3   df2     5     6
4   df2     7     8
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

北恋

2020-11-22 13:45

Another workaround for this one is using ldply in the plyr package...

df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)

df3
  .id x y
  df1 1 2
  df1 3 4
  df2 5 6
  df2 7 8

0 讨论(0)

没有蜡笔的小新

2020-11-22 13:47

A blend of the other two answers:

df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)

> foo <- function(...){
    args <- list(...)
    result <- do.call(rbind,args)
    result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
    result
 }

> foo(df1,df2,df1)
  x y source
1 1 1    df1
2 2 2    df1
3 3 3    df1
4 4 4    df2
5 5 5    df2
6 6 6    df2
7 1 1    df1
8 2 2    df1
9 3 3    df1

If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.

0 讨论(0)

清酒与你

2020-11-22 13:50
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
```
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
            df2 = data.frame("x"=c(2,4), "y"=4),
            df3 = data.frame("x"=2, "y"=c(4,5,7)))

> microbenchmark(cbind(do.call(rbind,dfs), 
                       rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
     min      lq     mean  median      uq      max neval
 393.541 409.083 454.9913 433.422 453.657 6157.649  1001
```
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.

This is similar to a previously posted solution, but about 2x faster.
```
> microbenchmark(do.call(rbind, 
                         lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))), 
                 times = 1001)
Unit: microseconds
      min      lq     mean  median       uq      max neval
  844.558 870.071 1034.182 896.464 1210.533 8867.858  1001
```
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2020-11-22 13:55

Another approach using dplyr:

df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))

df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')

df3
Source: local data frame [4 x 3]

  source     x     y
   (chr) (dbl) (dbl)
1    df1     1     2
2    df1     3     4
3    df2     5     6
4    df2     7     8

0 讨论(0)

青春惊慌失措

2020-11-22 14:04

I'm not sure if such a function already exists, but this seems to do the trick:

bindAndSource <-  function(df1, df2) { 
  df1$source <- as.character(match.call())[[2]]
  df2$source <- as.character(match.call())[[3]]
  rbind(df1, df2)
}

results:

bindAndSource(df1, df2)

1 1 2    df1
2 3 4    df1
3 5 6    df2
4 7 8    df2

Caveat: This will not work in *aply-like calls

0 讨论(0)

Combine (rbind) data frames and create column with name of original data frames

Update: Use `cbind` and `rbind`

Update 2: Use `combine` from the "gdata" package

Update 3: Use `rbindlist` from "data.table"

Update 4: use `map_df` from "purrr"

results:

Combine (rbind) data frames and create column with name of original data frames

Update: Use cbind and rbind

Update 2: Use combine from the "gdata" package

Update 3: Use rbindlist from "data.table"

Update 4: use map_df from "purrr"

results:

Update: Use `cbind` and `rbind`

Update 2: Use `combine` from the "gdata" package

Update 3: Use `rbindlist` from "data.table"

Update 4: use `map_df` from "purrr"