问题
What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
In this toy example, ignoring NA, x
is common category for id==1
and y
for id==2
.
回答1:
If you want to ignore NA
's, you have to exclude them first with !is.na(category)
, group by id
and category
(by = .(id, category)
) and create a frequency variable with .N
:
dt[!is.na(category), .N, by = .(id, category)]
which gives:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Ordering this by id
will give you a clearer picture:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
which results in:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
If you just want the rows which indicate the top results:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
or:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
which both give:
id category N
1: 1 x 3
2: 2 y 3
来源:https://stackoverflow.com/questions/34403017/concise-r-data-table-syntax-for-modal-value-most-frequent-by-group