问题
I updated to the latest version of data.table
- 1.9.4, from a medium-recent prior version (I think 1.8.X), and now I'm getting some unexpected behavior.
set.seed(12312014)
# a vector of letters a:e, each repeated between 1 and 10 times
type <- unlist(mapply(rep, letters[1:5], round(runif(5, 1, 10), 0)))
# a random vector of 3 categories
category <- sample(c('small', 'med', 'large'), length(type), replace=T)
my_dt <- data.table(type, category)
Say I want the proportion of category by type. I used to do that by doing this:
my_dt[, type_n:=.N, by=type]
my_dt[, .N/type_n, by=.(type, category)][order(type, category)]
what I get with data.table 1.9.4:
# type category V1
# 1: a large 0.2500000
# 2: a large 0.2500000
# 3: a med 0.2500000
# 4: a med 0.2500000
# 5: a small 0.5000000
# 6: a small 0.5000000
# 7: a small 0.5000000
# 8: a small 0.5000000
# 9: b large 0.4285714
# 10: b large 0.4285714
# 11: b large 0.4285714
# 12: b med 0.4285714
# (...and so on, 42 rows long)
but what I used to get, I'm virtually certain, was this (simple proportion of cat by type):
# type category V1
# 1: a large 0.2500000
# 2: a med 0.2500000
# 3: a small 0.5000000
# 4: b large 0.4285714
# 5: b med 0.4285714
# 6: b small 0.1428571
# 7: c large 0.3000000
# 8: c med 0.1000000
# 9: c small 0.6000000
# 10: d large 0.2222222
# 11: d med 0.6666667
# 12: d small 0.1111111
# 13: e large 0.3750000
# 14: e med 0.3750000
# 15: e small 0.2500000
I can get the desired result with this:
unique(my_dt[, .N/type_n, by=.(type, category)][order(type, category)])
...but I wondered if there's a preferred way in the new data.table syntax. I know I can also just use prop.table
, but I want it long format.
prop.table(table(my_dt), margin=1)
# category
# type large med small
# a 0.2500000 0.2500000 0.5000000
# b 0.4285714 0.4285714 0.1428571
# c 0.3000000 0.1000000 0.6000000
# d 0.2222222 0.6666667 0.1111111
# e 0.3750000 0.3750000 0.2500000
For reference, my sessionInfo call gives:
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_1.0.0 data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 labeling_0.2
[7] MASS_7.3-33 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.1
回答1:
Could try
my_dt[, .N, by=.(type,category)][, prop:=N/sum(N), by=type][]
type category N prop
1: a small 4 0.5000000
2: a med 2 0.2500000
3: a large 2 0.2500000
4: b med 3 0.4285714
5: b large 3 0.4285714
6: b small 1 0.1428571
7: c large 3 0.3000000
8: c small 6 0.6000000
9: c med 1 0.1000000
10: d med 6 0.6666667
11: d large 2 0.2222222
12: d small 1 0.1111111
13: e small 2 0.2500000
14: e med 3 0.3750000
15: e large 3 0.3750000
来源:https://stackoverflow.com/questions/27715846/new-behavior-in-data-table-n-something-with-by-calculate-proportion