Splitting a data.frame by a variable [duplicate]

前端未结

关注

 2  1139

我在风中等你

相关标签:

2条回答

遥遥无期

2020-11-30 14:58
split seems to be appropriate here.

If you start with the following data frame :
```
df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])
```
Then you can do :
```
split(df, df$ids)
```
And you will get a list of data frames :
```
R> split(df, df$ids)
$`1`
  ids x y
1   1 1 a
2   1 2 b

$`2`
  ids x y
3   2 3 c
4   2 4 d

$`3`
  ids x y
5   3 5 e
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2020-11-30 15:01

split is a generic. Whereas split.default is quite fast, split.data.frame gets terribly slow when the number of levels to split on increases.

The alternate (faster) solution would be to use data.table. I'll illustrate the difference on a bigger data here:

Sample data (what @Roland was referring to in his comment)

require(data.table)
set.seed(45)
DF <- data.frame(ids = sample(1e4, 1e6, TRUE), x = sample(letters, 1e6, TRUE), 
                  y = runif(1e6))
DT <- as.data.table(DF)

Functions + benchmarking

Note that the order of the data will be different here as split sorts by "ids". IF you want that you can first do setkey(DT, ids) and then run f2.

f1 <- function() split(DF, DF$ids)
f2 <- function() {
    ans <- DT[, list(list(.SD)), by=ids]$V1
    setattr(ans, 'names', unique(DT$ids)) # sets names by reference, no copy here.
}

require(microbenchmark)
microbenchmark(ans1 <- f1(), ans2 <- f2(), times=10)

# Unit: milliseconds
#          expr        min         lq     median         uq       max neval
#  ans1 <- f1() 37015.9795 43994.6629 48132.3364 49086.0926 63829.592    10
#  ans2 <- f2()   332.6094   361.1902   409.2191   528.0674  1005.457    10

split.data.frame took an average of 48 seconds wheres data.table took 0.41 seconds

0 讨论(0)

热议问题