Find top deciles from dataframe by group

前端 未结 3 2013
花落未央
花落未央 2021-01-23 07:59

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this

相关标签:
3条回答
  • 2021-01-23 08:14

    Stick to your Stata instincts and use a single data set:

    require(data.table)
    DT <- data.table(data)
    
    DT[,r:=rank(v2)/.N,by=v1]
    

    You can see the result by typing DT.


    From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...

    DT[,g:={
      x = rep(0,.N)
      x[r>.8] = 20
      x[r>.9] = 10
      x
    }]
    

    This is like gen and then two replace ... if statements. Again, you can see the result with DT.


    Finally, you can subset with

    DT[g>0]
    

    which gives

       custID v1 v2     r  g
    1:      1  A 30 1.000 10
    2:      2  A 29 0.900 20
    3:      1  B 20 0.975 10
    4:      2  B 19 0.875 20
    5:      6  B 20 0.975 10
    6:      7  B 19 0.875 20
    

    These steps can also be chained together:

    DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
    

    (Thanks to @ExperimenteR:)

    To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:

    dcast(
      DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0], 
      custID~v1)
    

    Currently, dcast requires the latest version of data.table, available (I think) from Github.

    0 讨论(0)
  • 2021-01-23 08:19

    The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.

    lapply(split(data, data$v1), function(df) {
        cutoff <- quantile(df$v2, c(0.8, 0.9))
        top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
        na.omit(data.frame(id=df$custID, top_pct))
    })
    

    Finding quantiles is done with quantile.

    0 讨论(0)
  • 2021-01-23 08:37

    You don't need the function pf to achieve what you want. Try dplyr/tidyr combo

    library(dplyr)
    library(tidyr)
    data %>% 
        group_by(v1) %>% 
        arrange(desc(v2))%>%
        mutate(n=n()) %>% 
        filter(row_number() <= round(n * .2)) %>% 
        mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
        select(custID, top_pct) %>% 
        spread(v1,  top_pct)
    #  custID  A  B
    #1      1 10 10
    #2      2 20 20
    #3      6 NA 10
    #4      7 NA 20
    
    0 讨论(0)
提交回复
热议问题