Binning data in R

核能气质少年 提交于 2019-11-27 17:04:21

问题


I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins).

v<-c(1:4000)

V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals


回答1:


Use cut and tapply:

> tapply(v, cut(v, 60), median)
          (-3,67.7]          (67.7,134]           (134,201]           (201,268] 
               34.0               101.0               167.5               234.0 
          (268,334]           (334,401]           (401,468]           (468,534] 
              301.0               367.5               434.0               501.0 
          (534,601]           (601,668]           (668,734]           (734,801] 
              567.5               634.0               701.0               767.5 
          (801,867]           (867,934]         (934,1e+03]    (1e+03,1.07e+03] 
              834.0               901.0               967.5              1034.0 
(1.07e+03,1.13e+03]  (1.13e+03,1.2e+03]  (1.2e+03,1.27e+03] (1.27e+03,1.33e+03] 
             1101.0              1167.5              1234.0              1301.0 
 (1.33e+03,1.4e+03]  (1.4e+03,1.47e+03] (1.47e+03,1.53e+03]  (1.53e+03,1.6e+03] 
             1367.5              1434.0              1500.5              1567.0 
 (1.6e+03,1.67e+03] (1.67e+03,1.73e+03]  (1.73e+03,1.8e+03]  (1.8e+03,1.87e+03] 
             1634.0              1700.5              1767.0              1834.0 
(1.87e+03,1.93e+03]    (1.93e+03,2e+03]    (2e+03,2.07e+03] (2.07e+03,2.13e+03] 
             1900.5              1967.0              2034.0              2100.5 
 (2.13e+03,2.2e+03]  (2.2e+03,2.27e+03] (2.27e+03,2.33e+03]  (2.33e+03,2.4e+03] 
             2167.0              2234.0              2300.5              2367.0 
 (2.4e+03,2.47e+03] (2.47e+03,2.53e+03]  (2.53e+03,2.6e+03]  (2.6e+03,2.67e+03] 
             2434.0              2500.5              2567.0              2634.0 
(2.67e+03,2.73e+03]  (2.73e+03,2.8e+03]  (2.8e+03,2.87e+03] (2.87e+03,2.93e+03] 
             2700.5              2767.0              2833.5              2900.0 
   (2.93e+03,3e+03]    (3e+03,3.07e+03] (3.07e+03,3.13e+03]  (3.13e+03,3.2e+03] 
             2967.0              3033.5              3100.0              3167.0 
 (3.2e+03,3.27e+03] (3.27e+03,3.33e+03]  (3.33e+03,3.4e+03]  (3.4e+03,3.47e+03] 
             3233.5              3300.0              3367.0              3433.5 
(3.47e+03,3.53e+03]  (3.53e+03,3.6e+03]  (3.6e+03,3.67e+03] (3.67e+03,3.73e+03] 
             3500.0              3567.0              3633.5              3700.0 
 (3.73e+03,3.8e+03]  (3.8e+03,3.87e+03] (3.87e+03,3.93e+03]    (3.93e+03,4e+03] 
             3767.0              3833.5              3900.0              3967.0



回答2:


In the past, i've used this function

evenbins <- function(x, bin.count=10, order=T) {
    bin.size <- rep(length(x) %/% bin.count, bin.count)
    bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
    bin <- rep(1:bin.count, bin.size)
    if(order) {    
        bin <- bin[rank(x,ties.method="random")]
    }
    return(factor(bin, levels=1:bin.count, ordered=order))
}

and then i can run it with

v.bin <- evenbins(v, 60)

and check the sizes with

table(v.bin)

and see they all contain 66 or 67 elements. By default this will order the values just like cut will so each of the factor levels will have increasing values. If you want to bin them based on their original order,

v.bin <- evenbins(v, 60, order=F)

instead. This just split the data up in the order it appears




回答3:


This result shows the 59 median values of the break-points. The 60 bin values are probably as close to equal as possible (but probably not exactly equal).

> sq <- seq(1, 4000, length = 60)
> sapply(2:length(sq), function(i) median(c(sq[i-1], sq[i])))
# [1]   34.88983  102.66949  170.44915  238.22881  306.00847  373.78814
# [7]  441.56780  509.34746  577.12712  644.90678  712.68644  780.46610
#  ......

Actually, after checking, the bins are pretty darn close to being equal.

> unique(diff(sq))
# [1] 67.77966 67.77966 67.77966


来源:https://stackoverflow.com/questions/24359863/binning-data-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!