For each row return the column name of the largest value

前端 未结 8 2282
礼貌的吻别
礼貌的吻别 2020-11-21 07:06

I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier

相关标签:
8条回答
  • 2020-11-21 07:35

    Based on the above suggestions, the following data.table solution worked very fast for me:

    library(data.table)
    
    set.seed(45)
    DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
    
    system.time(
      DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
    )
    #>    user  system elapsed 
    #>    0.15    0.06    0.21
    DT[]
    #>          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
    #>       1:  7  4  1  2  3  7  6  6  6   1      V1
    #>       2:  4  6  9 10  6  2  7  7  1   3      V4
    #>       3:  3  4  9  8  9  9  8  8  6   7      V3
    #>       4:  4  8  8  9  7  5  9  2  7   1      V4
    #>       5:  4  3  9 10  2  7  9  6  6   9      V4
    #>      ---                                       
    #>  999996:  4  6 10  5  4  7  3  8  2   8      V3
    #>  999997:  8  7  6  6  3 10  2  3 10   1      V6
    #>  999998:  2  3  2  7  4  7  5  2  7   3      V4
    #>  999999:  8 10  3  2  3  4  5  1  1   4      V2
    #> 1000000: 10  4  2  6  6  2  8  4  7   4      V1
    

    And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:

    DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
    

    In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:

    DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
    
    0 讨论(0)
  • 2020-11-21 07:35

    One option from dplyr 1.0.0 could be:

    DF %>%
     rowwise() %>%
     mutate(row_max = names(.)[which.max(c_across(everything()))])
    
         V1    V2    V3 row_max
      <dbl> <dbl> <dbl> <chr>  
    1     2     7     9 V3     
    2     8     3     6 V1     
    3     1     5     4 V2     
    

    Sample data:

    DF <- structure(list(V1 = c(2, 8, 1), V2 = c(7, 3, 5), V3 = c(9, 6, 
    4)), class = "data.frame", row.names = c(NA, -3L))
    
    0 讨论(0)
  • 2020-11-21 07:42

    If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!

    Here I've generated data of your dimensions (26746 * 18).

    Data

    set.seed(45)
    DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
    

    data.table answer:

    require(data.table)
    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
    

    Benchmarking:

    # data.table solution
    system.time({
    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
    })
    #   user  system elapsed 
    #  0.174   0.029   0.227 
    
    # apply solution from @thelatemail
    system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
    #   user  system elapsed 
    #  2.322   0.036   2.602 
    
    identical(t1, t2)
    # [1] TRUE
    

    It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.


    Edit: if any of the max ids is okay, then:

    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid)), rowid, mult="last"]
    
    0 讨论(0)
  • 2020-11-21 07:42

    Here is an answer that works with data.table and is simpler. This assumes your data.table is named yourDF:

    j1 <- max.col(yourDF[, .(V1, V2, V3, V4)], "first")
    yourDF$newCol <- c("V1", "V2", "V3", "V4")[j1]
    

    Replace ("V1", "V2", "V3", "V4") and (V1, V2, V3, V4) with your column names

    0 讨论(0)
  • 2020-11-21 07:45

    A dplyr solution:

    Idea:

    • add rowids as a column
    • reshape to long format
    • filter for max in each group

    Code:

    DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
    DF %>% 
      rownames_to_column() %>%
      gather(column, value, -rowname) %>%
      group_by(rowname) %>% 
      filter(rank(-value) == 1) 
    

    Result:

    # A tibble: 3 x 3
    # Groups:   rowname [3]
      rowname column value
      <chr>   <chr>  <dbl>
    1 2       V1         8
    2 3       V2         5
    3 1       V3         9
    

    This approach can be easily extended to get the top n columns. Example for n=2:

    DF %>% 
      rownames_to_column() %>%
      gather(column, value, -rowname) %>%
      group_by(rowname) %>% 
      mutate(rk = rank(-value)) %>%
      filter(rk <= 2) %>% 
      arrange(rowname, rk) 
    

    Result:

    # A tibble: 6 x 4
    # Groups:   rowname [3]
      rowname column value    rk
      <chr>   <chr>  <dbl> <dbl>
    1 1       V3         9     1
    2 1       V2         7     2
    3 2       V1         8     1
    4 2       V3         6     2
    5 3       V2         5     1
    6 3       V3         4     2
    
    0 讨论(0)
  • 2020-11-21 07:49

    One option using your data (for future reference, use set.seed() to make examples using sample reproducible):

    DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
    
    colnames(DF)[apply(DF,1,which.max)]
    [1] "V3" "V1" "V2"
    

    A faster solution than using apply might be max.col:

    colnames(DF)[max.col(DF,ties.method="first")]
    #[1] "V3" "V1" "V2"
    

    ...where ties.method can be any of "random" "first" or "last"

    This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:

    DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
    apply(DF,1,function(x) which(x==max(x)))
    
    [[1]]
    V2 V3 
     2  3 
    
    [[2]]
    V1 
     1 
    
    [[3]]
    V2 
     2 
    
    0 讨论(0)
提交回复
热议问题