Binning data in R

半腔热情 提交于 2019-11-29 02:32:54

Use cut and tapply:

> tapply(v, cut(v, 60), median)
          (-3,67.7]          (67.7,134]           (134,201]           (201,268] 
               34.0               101.0               167.5               234.0 
          (268,334]           (334,401]           (401,468]           (468,534] 
              301.0               367.5               434.0               501.0 
          (534,601]           (601,668]           (668,734]           (734,801] 
              567.5               634.0               701.0               767.5 
          (801,867]           (867,934]         (934,1e+03]    (1e+03,1.07e+03] 
              834.0               901.0               967.5              1034.0 
(1.07e+03,1.13e+03]  (1.13e+03,1.2e+03]  (1.2e+03,1.27e+03] (1.27e+03,1.33e+03] 
             1101.0              1167.5              1234.0              1301.0 
 (1.33e+03,1.4e+03]  (1.4e+03,1.47e+03] (1.47e+03,1.53e+03]  (1.53e+03,1.6e+03] 
             1367.5              1434.0              1500.5              1567.0 
 (1.6e+03,1.67e+03] (1.67e+03,1.73e+03]  (1.73e+03,1.8e+03]  (1.8e+03,1.87e+03] 
             1634.0              1700.5              1767.0              1834.0 
(1.87e+03,1.93e+03]    (1.93e+03,2e+03]    (2e+03,2.07e+03] (2.07e+03,2.13e+03] 
             1900.5              1967.0              2034.0              2100.5 
 (2.13e+03,2.2e+03]  (2.2e+03,2.27e+03] (2.27e+03,2.33e+03]  (2.33e+03,2.4e+03] 
             2167.0              2234.0              2300.5              2367.0 
 (2.4e+03,2.47e+03] (2.47e+03,2.53e+03]  (2.53e+03,2.6e+03]  (2.6e+03,2.67e+03] 
             2434.0              2500.5              2567.0              2634.0 
(2.67e+03,2.73e+03]  (2.73e+03,2.8e+03]  (2.8e+03,2.87e+03] (2.87e+03,2.93e+03] 
             2700.5              2767.0              2833.5              2900.0 
   (2.93e+03,3e+03]    (3e+03,3.07e+03] (3.07e+03,3.13e+03]  (3.13e+03,3.2e+03] 
             2967.0              3033.5              3100.0              3167.0 
 (3.2e+03,3.27e+03] (3.27e+03,3.33e+03]  (3.33e+03,3.4e+03]  (3.4e+03,3.47e+03] 
             3233.5              3300.0              3367.0              3433.5 
(3.47e+03,3.53e+03]  (3.53e+03,3.6e+03]  (3.6e+03,3.67e+03] (3.67e+03,3.73e+03] 
             3500.0              3567.0              3633.5              3700.0 
 (3.73e+03,3.8e+03]  (3.8e+03,3.87e+03] (3.87e+03,3.93e+03]    (3.93e+03,4e+03] 
             3767.0              3833.5              3900.0              3967.0

In the past, i've used this function

evenbins <- function(x, bin.count=10, order=T) {
    bin.size <- rep(length(x) %/% bin.count, bin.count)
    bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
    bin <- rep(1:bin.count, bin.size)
    if(order) {    
        bin <- bin[rank(x,ties.method="random")]
    }
    return(factor(bin, levels=1:bin.count, ordered=order))
}

and then i can run it with

v.bin <- evenbins(v, 60)

and check the sizes with

table(v.bin)

and see they all contain 66 or 67 elements. By default this will order the values just like cut will so each of the factor levels will have increasing values. If you want to bin them based on their original order,

v.bin <- evenbins(v, 60, order=F)

instead. This just split the data up in the order it appears

This result shows the 59 median values of the break-points. The 60 bin values are probably as close to equal as possible (but probably not exactly equal).

> sq <- seq(1, 4000, length = 60)
> sapply(2:length(sq), function(i) median(c(sq[i-1], sq[i])))
# [1]   34.88983  102.66949  170.44915  238.22881  306.00847  373.78814
# [7]  441.56780  509.34746  577.12712  644.90678  712.68644  780.46610
#  ......

Actually, after checking, the bins are pretty darn close to being equal.

> unique(diff(sq))
# [1] 67.77966 67.77966 67.77966
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!