I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier
Based on the above suggestions, the following data.table
solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT[]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD
should consider by mentioning them in .SDcols
:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD
:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
One option from dplyr 1.0.0
could be:
DF %>%
rowwise() %>%
mutate(row_max = names(.)[which.max(c_across(everything()))])
V1 V2 V3 row_max
<dbl> <dbl> <dbl> <chr>
1 2 7 9 V3
2 8 3 6 V1
3 1 5 4 V2
Sample data:
DF <- structure(list(V1 = c(2, 8, 1), V2 = c(7, 3, 5), V3 = c(9, 6,
4)), class = "data.frame", row.names = c(NA, -3L))
If you're interested in a data.table
solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table
answer:require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table
scales pretty well too.
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
Here is an answer that works with data.table and is simpler. This assumes your data.table is named yourDF
:
j1 <- max.col(yourDF[, .(V1, V2, V3, V4)], "first")
yourDF$newCol <- c("V1", "V2", "V3", "V4")[j1]
Replace ("V1", "V2", "V3", "V4")
and (V1, V2, V3, V4)
with your column names
dplyr
solution:Idea:
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n
columns.
Example for n=2
:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
One option using your data (for future reference, use set.seed()
to make examples using sample
reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply
might be max.col
:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method
can be any of "random"
"first"
or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2