How to select the row with the maximum value in each group

后端未结

关注

 16  1985

In a dataset with multiple observations for each subject I want to take a subset with only the maximum data value for each record. For example, with a following dataset:

相关标签:

16条回答

谎友^

2020-11-21 04:53

One more base R solution:

merge(aggregate(pt ~ Subject, max, data = group), group)

  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

0 讨论(0)

旧时难觅i

2020-11-21 04:54

Another option is slice

library(dplyr)
group %>%
     group_by(Subject) %>%
     slice(which.max(pt))
#    Subject    pt Event
#    <dbl> <dbl> <dbl>
#1       1     5     2
#2       2    17     2
#3       3     5     2

0 讨论(0)

遇见更好的自我

2020-11-21 04:57

The most intuitive method is to use group_by and top_n function in dplyr

    group %>% group_by(Subject) %>% top_n(1, pt)

The result you get is

    Source: local data frame [3 x 3]
    Groups: Subject [3]

      Subject    pt Event
        (dbl) (dbl) (dbl)
    1       1     5     2
    2       2    17     2
    3       3     5     2

0 讨论(0)

一整个雨季

2020-11-21 04:57

do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))

Using Base R

0 讨论(0)

耶瑟儿～

2020-11-21 05:00
by is a version of tapply for data frames:
```
res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])
```
It returns an object of class by so we convert it to data frame:
```
do.call(rbind, b)
  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

野趣味

2020-11-21 05:00

Another data.table option:

library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]

Or another (less readable but slightly faster):

group[group[, rn := .I][order(Subject, -pt), {
    rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]

timing code:

library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)


microbenchmark::microbenchmark(times=3L,
    mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
    mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
    mtd2 = {a2 <- DT2[DT2[, rn := .I][
        order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
    ]]},
    mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493     3
 mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022     3
 mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814     3
 mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851     3

0 讨论(0)