Concise R data.table syntax for modal value (most frequent) by group

六月ゝ 毕业季﹏ 提交于 2019-12-11 00:12:20

问题


What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)

dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)

In this toy example, ignoring NA, x is common category for id==1 and y for id==2.


回答1:


If you want to ignore NA's, you have to exclude them first with !is.na(category), group by id and category (by = .(id, category)) and create a frequency variable with .N:

 dt[!is.na(category), .N, by = .(id, category)]

which gives:

   id category N
1:  1        x 3
2:  2        y 3
3:  2        x 2
4:  1        y 2

Ordering this by id will give you a clearer picture:

 dt[!is.na(category), .N, by = .(id, category)][order(id)]

which results in:

   id category N
1:  1        x 3
2:  1        y 2
3:  2        y 3
4:  2        x 2

If you just want the rows which indicate the top results:

dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]

or:

dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]

which both give:

   id category N
1:  1        x 3
2:  2        y 3


来源:https://stackoverflow.com/questions/34403017/concise-r-data-table-syntax-for-modal-value-most-frequent-by-group

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!