Find top deciles from dataframe by group

前端 未结 3 2014
花落未央
花落未央 2021-01-23 07:59

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this

3条回答
  •  心在旅途
    2021-01-23 08:14

    Stick to your Stata instincts and use a single data set:

    require(data.table)
    DT <- data.table(data)
    
    DT[,r:=rank(v2)/.N,by=v1]
    

    You can see the result by typing DT.


    From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...

    DT[,g:={
      x = rep(0,.N)
      x[r>.8] = 20
      x[r>.9] = 10
      x
    }]
    

    This is like gen and then two replace ... if statements. Again, you can see the result with DT.


    Finally, you can subset with

    DT[g>0]
    

    which gives

       custID v1 v2     r  g
    1:      1  A 30 1.000 10
    2:      2  A 29 0.900 20
    3:      1  B 20 0.975 10
    4:      2  B 19 0.875 20
    5:      6  B 20 0.975 10
    6:      7  B 19 0.875 20
    

    These steps can also be chained together:

    DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
    

    (Thanks to @ExperimenteR:)

    To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:

    dcast(
      DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0], 
      custID~v1)
    

    Currently, dcast requires the latest version of data.table, available (I think) from Github.

提交回复
热议问题