New behavior in data.table? .N / something with `by` (calculate proportion)

问题

I updated to the latest version of data.table - 1.9.4, from a medium-recent prior version (I think 1.8.X), and now I'm getting some unexpected behavior.

set.seed(12312014)

# a vector of letters a:e, each repeated between 1 and 10 times
type <- unlist(mapply(rep, letters[1:5], round(runif(5, 1, 10), 0)))

# a random vector of 3 categories
category <- sample(c('small', 'med', 'large'), length(type), replace=T)
my_dt <- data.table(type, category)

Say I want the proportion of category by type. I used to do that by doing this:

my_dt[, type_n:=.N, by=type]
my_dt[, .N/type_n, by=.(type, category)][order(type, category)]

what I get with data.table 1.9.4:

# type category        V1
# 1:    a    large 0.2500000
# 2:    a    large 0.2500000
# 3:    a      med 0.2500000
# 4:    a      med 0.2500000
# 5:    a    small 0.5000000
# 6:    a    small 0.5000000
# 7:    a    small 0.5000000
# 8:    a    small 0.5000000
# 9:    b    large 0.4285714
# 10:    b    large 0.4285714
# 11:    b    large 0.4285714
# 12:    b      med 0.4285714
# (...and so on, 42 rows long)

but what I used to get, I'm virtually certain, was this (simple proportion of cat by type):

# type category        V1
# 1:    a    large 0.2500000
# 2:    a      med 0.2500000
# 3:    a    small 0.5000000
# 4:    b    large 0.4285714
# 5:    b      med 0.4285714
# 6:    b    small 0.1428571
# 7:    c    large 0.3000000
# 8:    c      med 0.1000000
# 9:    c    small 0.6000000
# 10:    d    large 0.2222222
# 11:    d      med 0.6666667
# 12:    d    small 0.1111111
# 13:    e    large 0.3750000
# 14:    e      med 0.3750000
# 15:    e    small 0.2500000

I can get the desired result with this:

unique(my_dt[, .N/type_n, by=.(type, category)][order(type, category)])

...but I wondered if there's a preferred way in the new data.table syntax. I know I can also just use prop.table, but I want it long format.

prop.table(table(my_dt), margin=1)
# category
# type     large       med     small
#    a 0.2500000 0.2500000 0.5000000
#    b 0.4285714 0.4285714 0.1428571
#    c 0.3000000 0.1000000 0.6000000
#    d 0.2222222 0.6666667 0.1111111
#    e 0.3750000 0.3750000 0.2500000

For reference, my sessionInfo call gives:

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_1.0.0    data.table_1.9.4

loaded via a namespace (and not attached):
 [1] chron_2.3-45     colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2     labeling_0.2    
 [7] MASS_7.3-33      munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
[13] scales_0.2.4     stringr_0.6.2    tools_3.1.1

回答1:

Could try

my_dt[, .N, by=.(type,category)][, prop:=N/sum(N), by=type][]

    type category N      prop
 1:    a    small 4 0.5000000
 2:    a      med 2 0.2500000
 3:    a    large 2 0.2500000
 4:    b      med 3 0.4285714
 5:    b    large 3 0.4285714
 6:    b    small 1 0.1428571
 7:    c    large 3 0.3000000
 8:    c    small 6 0.6000000
 9:    c      med 1 0.1000000
10:    d      med 6 0.6666667
11:    d    large 2 0.2222222
12:    d    small 1 0.1111111
13:    e    small 2 0.2500000
14:    e      med 3 0.3750000
15:    e    large 3 0.3750000

来源：https://stackoverflow.com/questions/27715846/new-behavior-in-data-table-n-something-with-by-calculate-proportion

标签

data.table