Extract row corresponding to minimum value of a variable by group

前端 未结 6 1818
孤独总比滥情好
孤独总比滥情好 2020-11-22 04:04

I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extr

相关标签:
6条回答
  • 2020-11-22 04:41

    Corrected plyr solution:

    ddply(df, .(State), function(x) x[which.min(x$Employees),])
    #   State Company Employees
    # 1    AK       D        24
    # 2    RI       E        19
    

    thanks to @joel.wilson

    0 讨论(0)
  • 2020-11-22 04:42

    Here a dplyr solution ( Note that I am not a regular user ):

    library(dplyr)    
    data %>% 
        group_by(State) %>% 
        slice(which.min(Employees))
    
    0 讨论(0)
  • 2020-11-22 04:45

    In base you can use ave to get min per group and compare this with Employees and get a logical vector to subset the data.frame.

    data[data$Employees == ave(data$Employees, data$State, FUN=min),]
    #  State Company Employees
    #4    AK       D        24
    #5    RI       E        19
    

    Or compare it already in the function.

    data[as.logical(ave(data$Employees, data$State, FUN=function(x) x==min(x))),]
    #data[ave(data$Employees, data$State, FUN=function(x) x==min(x))==1,] #Variant
    #  State Company Employees
    #4    AK       D        24
    #5    RI       E        19
    
    0 讨论(0)
  • 2020-11-22 04:50

    The base function by is often useful for working with block data in data.frames. For example

    by(data, data$State, function(x) x[which.min(x$Employees), ] )
    

    It does return the data in a list, but you can collapse that with

    do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))
    
    0 讨论(0)
  • 2020-11-22 05:01

    Slightly more elegant:

    library(data.table)
    DT[ , .SD[which.min(Employees)], by = State]
    
       State Company Employees
    1:    AK       D        24
    2:    RI       E        19
    

    Slighly less elegant than using .SD, but a bit faster (for data with many groups):

    DT[DT[ , .I[which.min(Employees)], by = State]$V1]
    

    Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

    See also Subset rows corresponding to max value by group using data.table.

    0 讨论(0)
  • 2020-11-22 05:01

    As this is Google's top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

    Either using data.table

    library(data.table)
    unique(setDT(data)[order(Employees)], by = "State")
    #    State Company Employees
    # 1:    RI       E        19
    # 2:    AK       D        24
    

    Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

    setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
    # order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
    # i clause present and columns used in by detected, only these subset: State 
    # Finding groups using forderv ... 0 sec
    # Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
    # Getting back original order ... 0 sec
    # lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
    # GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
    # Making each group and running j (GForce TRUE) ... 0 secs
    #    State Company Employees
    # 1:    RI       E        19
    # 2:    AK       D        24
    

    Or dplyr

    library(dplyr)
    data %>% 
      arrange(Employees) %>% 
      distinct(State, .keep_all = TRUE)
    #   State Company Employees
    # 1    RI       E        19
    # 2    AK       D        24
    

    Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = "first" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

    tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
    data[tmp, on = .(State, Employees), mult = "first"]
    #    State Company Employees
    # 1:    AK       D        24
    # 2:    RI       E        19
    

    Some benchmarks

    library(data.table)
    library(dplyr)
    library(plyr)
    library(stringi)
    library(microbenchmark)
    
    set.seed(123)
    N <- 1e6
    data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                       Employees = sample(N*10, N, replace = TRUE))
    DT <- copy(data)
    setDT(DT)
    DT2 <- copy(DT)
    str(DT)
    str(DT2)
    
    microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
                   "(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
                   "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
                   "(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
                   "(data.table) self join (on):" = {
                     tmp <- DT[, .(Employees = min(Employees)), by = State]
                     DT[tmp, on = .(State, Employees), mult = "first"]},
                   "(data.table) self join (setkey):" = {
                     tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                     setkey(tmp, State, Employees)
                     setkey(DT2, State, Employees)
                     DT2[tmp, mult = "first"]},
                   "(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
                   "(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
                   "(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
                   "(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
                   "(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))
    
    
    # Unit: milliseconds
    #                             expr        min         lq       mean     median         uq       max neval      cld
    #    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
    #     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
    #      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
    #     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
    #     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
    # (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
    #       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
    #       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
    # (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
    #         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
    #                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h
    
    0 讨论(0)
提交回复
热议问题